Assessing Diagnostic Reasoning: A Consensus Statement Summarizing Theory, Practice, and Future Needs


  • Breakout session participants: Chandra Aubin, Kat Bailey, Jeremy Branzetti, Rob Cloutier, Eva Delgado, Frank Fernandez, Doug Franzen, Robert Furlong, David Gordon, Nikhil Goyal, Richard Gray, Nathan Haas, Danielle Hart, Emily Hayden, Corey Heitz, Sheryl Heron, Cherri Hobgood, Laura Hopson, Hans House, Sharhabeel Jwayyed, Sorabh Khandelwal, Paul Ko, Amy Kontrick, Richard Lammers, Katrina Leone, Michelle Lin, Kerry McCabe, Chris McDowell, Brian Nelson, Elliot Rodriguez, Nestor Rodriguez, Sally Santen,Tim Schaefer, Jeff Siegelman, Bill Soares, Susan Stern, Tom Swoboda, James Takayesu, Dave Wald, Clare Wallner, John Wightman, Adam Wilson, and Paul Zgurzynski.

  • This paper reports on a workshop session of the 2012 Academic Emergency Medicine consensus conference, “Education Research in Emergency Medicine: Opportunities, Challenges, and Strategies for Success,” May 9, 2012, Chicago, IL.

  • The authors have no relevant financial information or potential conflicts of interest to disclose.

Address for correspondence and reprints: Jonathan S. Ilgen, MD, MCR; e-mail:


Assessment of an emergency physician (EP)'s diagnostic reasoning skills is essential for effective training and patient safety. This article summarizes the findings of the diagnostic reasoning assessment track of the 2012 Academic Emergency Medicine consensus conference “Education Research in Emergency Medicine: Opportunities, Challenges, and Strategies for Success.” Existing theories of diagnostic reasoning, as they relate to emergency medicine (EM), are outlined. Existing strategies for the assessment of diagnostic reasoning are described. Based on a review of the literature, expert thematic analysis, and iterative consensus agreement during the conference, this article summarizes current assessment gaps and prioritizes future research questions concerning the assessment of diagnostic reasoning in EM.

Psychologists have been studying how people think for decades. Translation and application of these theories to medicine has accelerated in recent years in response to emerging themes of patient safety and competency-based education.[1-5] Emergency physicians (EPs) are challenged daily by the vast spectrum and acuity of clinical presentations they diagnose in a data-poor, rapidly evolving, decision-dense environment. Diagnostic uncertainty is a hallmark of emergency medicine (EM), yet as a result of these factors, it is perhaps not surprising that errors are made.[6-8] One retrospective study of patients evaluated by EPs reported a diagnostic error rate of 0.6%.[9] In contrast, 37% to 70% of malpractice claims allege physician negligence or diagnostic error,[6, 10] and in one study, 96% of missed ED diagnoses were attributed to cognitive factors.[8] However, the retrospective analyses used in all of these studies are prone to substantial hindsight bias, where further clinical information, and the evolution of patients' symptoms were unavailable at the time of ED diagnosis.[11]

In May 2012, Academic Emergency Medicine hosted a consensus conference entitled “Education Research in Emergency Medicine: Opportunities, Challenges, and Strategies for Success,” with the goals of defining research agendas that address the measurement gaps in EM education and building infrastructure for collaboration in these domains. This article reports on the findings of the diagnostic reasoning assessment breakout session. Through a qualitative process that included a review of the literature, expert thematic analysis, and iterative consensus agreement at the conference, current assessment gaps are summarized and future research questions concerning the assessment of diagnostic reasoning are prioritized.


Research in cognitive psychology has explored how individuals reason when solving problems.[12-14] Emerging theories suggest ways to understand the process of diagnostic reasoning in medicine.[15-18] One clear conclusion is that general problem-solving strategies cannot be effectively taught, learned, or applied.[19-21] Success on one type of problem does not predict success on another,[20, 22-24] nor does the quality of general reasoning processes appear to distinguish between experts and novices.[17] A classic study by Elstein et al.[20] demonstrated that experts have more knowledge than novices, and it is this increased knowledge that enables them to achieve a higher rate of diagnostic accuracy, rather than general problem solving skills. It is not only the amount of knowledge, but also the manner in which this knowledge is arranged in clinicians' memories, that facilitates accurate diagnostic reasoning.[25] Compared to novices, expert physicians are better able to access knowledge precisely because of their experience, while novices may be unable to connect existing knowledge to a “novel” clinical problem.[26, 27]

From didactic presentations, role modeling, case discussions, and clinical exposure, novices integrate networks of information, associative links, and memories of real patient encounters to form unique clusters of information for each diagnosis. Barrows and Feltovich[28] coined the term “illness scripts” for these complex collections of data. The illness script theory assumes that knowledge networks adapted to clinical tasks develop through experience and operate autonomously beneath the level of conscious awareness.[29, 30] Clinicians refine their unique collection of illness scripts based on real patient encounters, thereby forming idiosyncratic memories relating to a diagnosis.[27, 31, 32] Through experience, clinicians accumulate a vast “library” of patient presentations that can be rapidly and subconsciously accessed for the purpose of hypothesis generation and diagnostic decision-making.[18] This “pattern matching” is seen as the dominant mode of reasoning for most expert clinicians.[19, 33] These automatic reasoning processes, more recently labeled as System 1 thinking,[34-39] are nonanalytical, rapid, and require little cognitive effort.[39] In contrast, System 2 thinking is analytical, effortful, and employs a deductive search for a fit between the available information and appropriate scripts.[34-39] Novices employ this analytic mode of reasoning more frequently than their experienced counterparts because they lack the experience necessary for System 1 reasoning. However, while System 1 reasoning is a hallmark of the experienced physician, errors may result from an over-reliance on automatic reasoning.[40]

Most clinical scenarios require both systems. This combined approach, often referred to as “dual processing,” likely offers the best chance at diagnostic success, even for novices.[16, 17, 25] A series of studies in which undergraduate psychology students were taught to read electrocardiograms demonstrated improved performance when these subjects were given instructions to use both similarity (e.g., System 1) and feature identification (e.g., System 2) strategies for diagnosis, compared to use of either of these strategies alone.[41] It is possible that the combined use of automatic and analytic thinking is more beneficial for complex rather than simple cases[42] or when physicians anticipate difficulty.[43, 44]

Finally, diagnostic reasoning constructs must be considered in the context of a dynamic and decision-dense environment.[45] Studies suggest that EPs care for a median of six to seven patients with as many as 16 patients simultaneously.[46] The average time on particular tasks is limited to less than 2 minutes,[47] and interruptions occur every 2 to 10 minutes in the ED.[46-50] A study in a U.S. academic ED demonstrated that 42% of tasks were interrupted before completion.[46] In one study, when EPs were interrupted, they resumed the original suspended activity only after they performed one to eight additional activities.[51] Additionally, the high metabolic demand of analytical reasoning (i.e., System 2)[52, 53] is likely amplified in decision-dense environments. Based on cognitive load theory,[54] it is thus possible that current admonitions to “think carefully” in the ED environment (i.e., employ System 2) may, in fact, overwhelm working memory, and be detrimental.

Assessing Diagnostic Reasoning

Assessment should not be considered in isolation to other integrated elements of a training program (learning objectives, instructional methods, etc.) or the reward structure inherent in continuing professional development activities.[55] Neither should a single instrument or testing format be regarded as sufficient for assessing diagnostic reasoning. The psychometric properties, feasibility, acceptability, and educational effect of any strategy are dependent on its context and application.[56] An assessment program that employs multiple integrated strategies will provide the most robust process for determining physician competence in diagnostic reasoning.[57]

Beyond the specific issues noted in the review of each tool below, the following general issues limit existing assessment formats:

  • Diagnostic reasoning must be inferred from behavior because it is not a discrete, measurable quality and is not independent of context and content. To achieve any degree of validity, inferences must sample over a number of knowledge domains. Inferences are imprecise. The theorized dual process of reasoning cannot be isolated from the context in which it functions, nor can the explicit use of System 1 or System 2 be measured in the clinical environment.[16, 17] Any decision-making task requires a mixture of both processes.[58]
  • Existing instruments that assess diagnostic reasoning emphasize System 2. System 1 reasoning is unconscious and cannot be explicitly articulated with trustworthy accuracy. The shift between automatic and analytic reasoning (which may in fact be performed in parallel) is impossible to directly observe or absolutely infer in clinical environments.[59]
  • Accuracy, and therefore assessment of diagnostic reasoning, is influenced by context specificity. Diagnostic accuracy is not stable across all of the clinical domains that inform EM and mandates assessment on multiple patient problems. For example, the diagnostic accuracy of an EP confronted with a patient with chest pain does not necessarily correlate to his or her diagnostic accuracy of a patient with a vesicular exanthem.[57]
  • Expert assessors are influenced by their frame of reference, where a rater's personal knowledge, experience, ability, and personal bias (of the importance of a specific element in case management) influences his or her adjudication of a learner's performance in a nonstandard fashion.[60-64]

Assessing Diagnostic Reasoning in the Extra-Clinical Setting

Assessment of diagnostic reasoning outside of the clinical setting allows for better standardization, improved resource efficiency, sampling across a broad variety of clinical pathologies, and improved reliability over clinically based assessments.[65] Common criticisms of these modalities include lack of authenticity[66] and general neglect of the process of information gathering[15] (an important factor in diagnostic reasoning).

Written Examinations

The most commonly used assessment of diagnostic reasoning is through multiple-choice question (MCQ) examinations. These form the bulk of the United States Medical License Examination (USMLE) Steps 1, 2, and 3 and Part I of the Medical Council of Canada Qualifying Exam. Context-rich MCQs may be reasonable tests of decision-making in situations of certainty, where a particular answer has been determined to be most correct. MCQs can be produced with excellent psychometric qualities, can be easily administered, and in some ways are less resource-intensive than other assessment instruments.[65] A variety of multiple-choice examinations have been shown to have good predictive validity with respect to performance in practice as measured by peer assessment, a variety of clinical indicators including appropriate antibiotic prescribing and cardiac mortality, and patient complaints.[67-69]

A clear drawback to MCQs for the purposes of assessing diagnostic reasoning is that the list of predefined choices may cue responses. In clinical practice, diagnostic possibilities must be generated de novo by the practitioner, a crucial step in decision-making that may be driven by a preponderance of subconscious System 1 processes.[70, 71] Thus, the diagnostic hypothesis-generation stage is bypassed by the traditional MCQ format. MCQs likely have the most value in the context of diagnosis verification, a process that incorporates elements of both System 1 and System 2. If there is a correlation between hypothesis generation and diagnosis verification, then one could potentially infer performance in one from performance in the other.

One way to evaluate de novo hypothesis generation is to use key feature problems (KFPs).[72] Questions prompt learners with a clinical scenario followed by a series of open-ended questions on essential steps for resolution of the case. This testing format allows for several approaches to the same scenario and prompts learners to think about problem identification, diagnostic strategies, and management decisions.[72, 73] KFPs are used as part of licensing and certification examinations in several countries, and scores have been shown to correlate with clinical performance outcomes.[68]

The script concordance test (SCT) was developed based on the illness script theory.[74] Questions include a short clinical scenario followed by diagnostic possibilities and sequential pieces of information to consider. After each new piece of information, the learner is prompted to indicate how it affects decision-making using a Likert-type scale. SCTs can also be structured to probe knowledge about use of diagnostic tests or therapeutic interventions. Unlike MCQs, where there is always one most correct answer, this method compares the responses of learners to the range of responses generated by a reference panel of experts. Scoring is based on the degree of concordance with the reference panel. Thus, there is greater capacity to reflect the authentic situation of a clinical problem that does not necessarily have a simply defined, single correct answer. SCTs are challenging to develop, although they can be administered with the same ease as MCQs.[75] Research concerning SCTs suggests that scores offer a valid reflection of diagnostic reasoning, with test performance correlating with clinical experience and in-training examination scores.[76, 77] However, by design, SCTs emphasize System 2 processes, specifically how clinicians interpret data with a particular hypothesis in mind.

Oral Examinations

The American Board of Emergency Medicine uses an oral examination consisting of clinical scenarios to assess diagnostic reasoning. Clinical experts simultaneously provide sequential information about the scenario and assess the participant using a structured template of key actions. The expert assessor, an EP trained to give the examination, can explore diagnostic reasoning using semistructured prompts. Several studies suggest that oral exams can be valid assessment instruments,[78, 79] although these tools are resource-intensive when used for high-stakes assessment. Oral examinations that present multiple clinical scenarios simultaneously may offer a novel way to assess diagnostic reasoning that more closely approximates the cognitive load inherent in most ED settings.[7, 46, 48]

Objective Structured Clinical Examinations

Objective structured clinical examinations (OSCEs) use multiple brief stations, where at each a specific and truncated task is performed in a simulated environment. Scoring involves either standardized patients (SPs) or experts completing checklists or global rating scales.[80] Using an OSCE to assess diagnostic reasoning may be confounded by the use of: 1) SPs as nonexpert scorers (although some evidence exists that nonclinicians and nonexperts can be used with sufficient reliability),[81, 82] 2) checklists that focus on thoroughness of data gathering and devalue System 1 reasoning,[83] and 3) a truncated scenario that does not simulate the diagnostic density or complexity of clinical problems encountered in EM practice. Although OSCEs represent a step toward authenticity relative to written or oral examinations,[84] the validity of this method for diagnostic reasoning assessment remains uncertain.[85, 86]

Virtual Patients

Virtual patients, such as the computer-based case simulations employed in Step 3 of the USMLE, prompt examinees to obtain a history, perform a physical examination, and make diagnostic and therapeutic decisions.[87, 88] Examinees direct the patient encounter and the advance of simulated time, independently generating diagnostic possibilities and determining what additional information is necessary to confirm or refute initial hypotheses.[87] This type of testing attempts to bridge the gap between the control afforded by standardized testing and the authenticity of a true patient interaction. Through the lens of diagnostic error, one study demonstrated that 22% of examinees made potentially harmful decisions on the USMLE Step 3 computer case simulations, although the authors emphasized that such actions have not been shown to be predictive of a physician's decisions in a true clinical setting.[89]

Team-based Simulation

Team-based simulation (or crisis resource management simulation) has recently emerged as an instrument with potential to assess diagnostic reasoning. This type of simulation uses computerized mannequins, physical replications of clinical care areas, and multiple actors (nurses, respiratory therapists, physician colleagues, etc.) to approximate the complexity of diagnostic reasoning (among other competencies) in the clinical environment.[90] Simulation offers the benefits of standardization and opportunities to explore reasoning in greater detail using postencounter debriefings. However, the relationship between simulated and actual clinical performance is unclear.[91] Most current research has focused on the instructional and educational value of this type of simulation, rather than the use of this modality for assessment.[91, 92] While the efficacy of partial task trainers (i.e., partially simulated models of procedural tasks) to assess performance of procedural skills has been demonstrated,[92] further research is required before team-based simulation can be recommended for high-stakes assessment of diagnostic reasoning.[93]

Assessing Diagnostic Reasoning in the Clinical Setting

Workplace-based assessments sample learner performance in the clinical environment to form a judgment of diagnostic reasoning capacity.[25, 94] In artificial testing environments, the significant multitasking demands of EM are removed, perhaps leading to an artificial inflation of a learner's diagnostic reasoning ability. For these and other reasons, multiple undergraduate, graduate, and continuing medical education organizations have endorsed these types of in vivo assessments.[2-5, 95, 96] The feasibility of these methods is challenged by the time and attention required of assessors, especially given the frequency of interruptions and distractions in EDs.

Direct Observation Tools

There are few tools explicitly designed to assess diagnostic reasoning via direct observation in the clinical setting. The existing direct observation instruments (e.g., in-training evaluation reports, encounter cards, mini clinical evaluation exercise [mini-CEX], and the standardized direct observation tool [SDOT]) generally use checklists and global ratings that are completed by a physician assessor observing a learner perform focused element of patient care.[95-101] Narrative comments are typically required, but are often incomplete or limited in nature.[100, 101] While no instrument specifically addresses diagnostic reasoning, there are a number of items that are loosely approximated. For example, the mini-CEX includes assessment in the domain of “clinical judgment,”[97] while the SDOT includes ratings of learner performance in “synthesis/differential diagnosis” and “management.”[98, 99] These tools have the advantage of offering observations in authentic, real-time settings; however, the protocol requires the observer to infer the line of reasoning based on the behavior observed.

Retrospective Clinical Case Analysis

Retrospective clinical case analysis, in contrast to direct observation tools, provides an opportunity for learners to reflect upon past clinical decisions with real patients in the presence of an examiner.[102] Chart stimulated recall uses semistructured interviews with expert assessors. Learners are probed regarding their decision-making on actual cases, providing insights that may not be documented in the medical record or fully observed in real time.[103] A chart audit involves nonexpert assessors matching key metrics of patient care to a retrospective sample of a learner's charts.[104]

Multisource Feedback

Multisource feedback combines multiple assessments from the sphere of influence of the learner (e.g., resident peers, nurses, other health professionals, patients).[105-111] However, reliable assessment with this technique requires a large number of assessors. In a study of 1066 physicians in the United Kingdom,[112] it would require 34 patient questionnaires and 15 peer questionnaires to achieve a reliability of 0.70.

Future Instruments

If we are unable to open the “black box” of reasoning in clinical practice, perhaps what is most important is the accuracy of the ED diagnosis and not the process by which it was achieved. However, hospital discharge diagnoses, the current criterion standard against which ED diagnoses are measured, suffer from hindsight bias.[11] Actual patient outcomes may be the best measure of diagnostic reasoning. However, current measures of quality (i.e., core measures) are confounded by many elements outside of an EP's control and may not correlate with the accuracy of diagnostic reasoning. To address these challenges, future clinically based assessment instruments should consider ED-specific markers of patient care that correlate to accurate ED diagnoses.

Finally, the surreptitious introduction of SPs into EDs might standardize the assessment of common EM diagnoses while preserving the authenticity of the decision environment.[113] However, a number of operational considerations must first be addressed prior to widespread adoption, not the least of which is the ethics of blocking (real) patient access to emergency care because of the presence of SPs.

Prioritized List of Research Questions

With a review of the diagnostic reasoning literature and a thematic analysis by education and diagnostic reasoning experts, a series of important research questions was developed prior to the consensus conference. These questions were validated and prioritized using an iterative consensus process that consisted of background didactics by content experts, focused group discussion, and individual multivoting. Table 1 represents the highest priority education research questions relating to diagnostic reasoning as indicated by education thought leaders, education researchers, and front-line educators. The authors advocate for research programs to address these issues and funding agencies to promote research streams that explore these concepts.

Table 1. Prioritized List of Research Questions in Each Domain of Diagnostic Reasoning Assessment, as Validated by Participants in the Consensus Conference
  1. SCT = script concordance test.

Assessing diagnostic reasoning in the extraclinical setting
  1. What is the effect of distraction on diagnostic accuracy in simulated environments?
  2. What factors influence the predictive value of extraclinical assessments of diagnostic reasoning when comparing to performance in the clinical environment?
  3. What is the value in assessing a learner's diagnostic abilities at different points in training using a standardized bank of simulated cases?
  4. Can the SCT demonstrate the development of diagnostic reasoning in learners over time?
  5. What extraclinical tools adequately assess System 1 processes?
Assessing diagnostic reasoning in the clinical setting
  1. What patient-oriented outcomes or surrogate markers are reliable and valid indicators of accurate ED diagnostic reasoning?
  2. What is the effect of metacognition on diagnostic error in experienced EPs?
  3. What is the feasibility of assessing diagnostic reasoning in real time?
  4. What is the effect of cognitive load (e.g., treating multiple patients simultaneously) on diagnostic reasoning of simple and complex problems?
  5. What ED-specific factors inhibit the assessment of diagnostic reasoning?


Diagnostic reasoning is a complex process, with elements of both System 1 and System 2 thinking. Assessment of these processes must be inferred from behavior because it is not a discrete, measurable quality, nor is it independent of context and content. No single strategy can be used to assess the accuracy of a clinician's diagnostic decisions. Rather, multiple strategies must be used if an accurate assessment is to be gained. Many questions remain regarding how these reasoning processes can be most accurately measured, offering a multitude of avenues for future research that offer great potential to ultimately improve patient care.