Differential diagnosis checklists reduce diagnostic error differentially: A randomised experiment

Abstract Introduction Wrong and missed diagnoses contribute substantially to medical error. Can a prompt to generate alternative diagnoses (prompt) or a differential diagnosis checklist (DDXC) increase diagnostic accuracy? How do these interventions affect the diagnostic process and self‐monitoring? Methods Advanced medical students (N = 90) were randomly assigned to one of four conditions to complete six computer‐based patient cases: group 1 (prompt) was instructed to write down all diagnoses they considered while acquiring diagnostic test results and to finally rank them. Groups 2 and 3 received the same instruction plus a list of 17 differential diagnoses for the chief complaint of the patient. For half of the cases, the DDXC contained the correct diagnosis (DDXC+), and for the other half, it did not (DDXC−; counterbalanced). Group 4 (control) was only instructed to indicate their final diagnosis. Mixed‐effects models were used to analyse results. Results Students using a DDXC that contained the correct diagnosis had better diagnostic accuracy, mean (standard deviation), 0.75 (0.44), compared to controls without a checklist, 0.49 (0.50), P < 0.001, but those using a DDXC that did not contain the correct diagnosis did slightly worse, 0.43 (0.50), P = 0.602. The number and relevance of diagnostic tests acquired were not affected by condition, nor was self‐monitoring. However, participants spent more time on a case in the DDXC−, 4:20 min (2:36), P ≤ 0.001, and DDXC+ condition, 3:52 min (2:09), than in the control condition, 2:59 min (1:44), P ≤ 0.001. Discussion Being provided a list of possible diagnoses improves diagnostic accuracy compared with a prompt to create a differential diagnosis list, if the provided list contains the correct diagnosis. However, being provided a diagnosis list without the correct diagnosis did not improve and might have slightly reduced diagnostic accuracy. Interventions neither affected information gathering nor self‐monitoring.


| INTRODUCTION
Teaching students to diagnose a patient's condition is a key objective of medical education. 1 But besides the skill's prominent role in most medical curricula, diagnostic errors constitute a major source of medical errors, which may lead to patient harm and even death. 2,3 Diagnostic errors, especially missed diagnoses, frequently result from a failure to consider the correct diagnosis or settling on a wrong diagnosis too early. 4,5 Hence, considering alternative diagnostic hypotheses may improve diagnosis, especially early in the diagnostic process. [6][7][8] This is because the set of diagnostic hypotheses considered determines what diagnostic information physicians acquire (and omit) and how they interpret and integrate it. 9,10 For example, it has been shown that generating multiple hypotheses early on results in asking the patient more questions during a general practitioner (GP) consultation 11 and leads to more complete and less biased documentation among GPs. 12 Similarly, reflective practice, that is, a critical consideration of one's reasoning and decisions, may ultimately lead to fewer diagnostic errors, at least when facing difficult cases. 13 In contrast, diagnoses that are not considered or considered only late in the process are less likely to be detected, even with incoming supporting information. 10 These insights underlie efforts to reduce diagnostic errors by stimulating on-site hypothesis generation among diagnosticians, for example, with the help of diagnostic checklists. [14][15][16] Checklists may reduce the cognitive load of physicians of recalling all steps of a workup or of important differential diagnoses. This may be especially helpful during residency training, when junior physicians need their cognitive resources to train skills not covered in undergraduate education, such as management reasoning. 17 Although widely used in other high-risk settings such as operating rooms [18][19][20] and despite their face validity, empirical evidence on the effectiveness of diagnostic checklists is scarce and contradictory. [21][22][23][24][25][26][27] Moreover, little is known about their functioning and thus about how best to design them. We addressed this gap with an experimental study in an emergency room setting.
Different types of checklists have been proposed for their applicability during the diagnostic process. 14 General checklists are symptom independent and provide a list of relevant steps to reach a diagnosis, such as to consider alternative diagnoses or to pause to reflect. 14,16,28 Such checklists are intended to help reduce diagnostic errors as they may enforce the completion of critical steps during diagnosis or trigger deliberate thinking to circumvent cognitive biases. 29 Although general checklists vary in the number and type of steps they list, they commonly entail the instructions to consider alternative diagnoses and their respective likelihood.
But despite their intuitive appeal as economical and situation independent, research has not yet demonstrated substantial error reduction with their use, 21,30,31 similar to other methods of general debiasing. 32,33 Here, we test whether the generic prompts to consider alternative hypotheses and rank them results in more accurate diagnoses.
Symptom-or disease-specific checklists such as differential diagnosis checklists (DDXCs), in contrast, propose comprehensive lists of causes for specific presenting complaints. 14,34 A DDXC may act as a retrieval aid or reminder of frequent or commonly missed diagnoses and trigger consideration of more alternative diagnoses and hence further data gathering. 34 Indeed, DDXCs have been shown to increase diagnostic accuracy in difficult cases. 21 Yet tests on the robustness of this effect are missing. Also, given that it is hardly feasible to list all possible differential diagnoses in practice, (uncommon) diagnoses missing on a checklist may be considered even less. This hypothesis rests on the findings of related research on availability bias, showing that the presentation of a diagnosis may (falsely) prime diagnosticians in their subsequent diagnosis. 35,36 Also, there is evidence showing that, whereas computer aids providing correct advice may improve internists' ECG interpretation, they may lower accuracy when providing an incorrect interpretation. 37 It is therefore of great practical importance to understand whether the inclusion of the correct diagnosis on a DDXC is crucial to its beneficial effect and whether its absence may even decrease the likelihood of making the correct diagnosis.
Moreover, it is not yet well understood how the utilisation of prompts or checklists impacts the diagnostic process. 38 In fact, it is conceivable that enlarging the hypothesis space may lead to 'excessive consultation or needless testing' (p. 311) 14 rather than to just remedy undertesting. 39 Furthermore, influencing the way diagnosticians approach a problem may affect their meta-cognitive monitoring, because the actual task changes from a having to create a diagnosis to merely having to select it (at least, when it is on the DDXC). Previous research on create versus select response tasks in assessments suggests that they differ in important characteristics of meta-cognition. 40 Here, we examine how the specific interventions discussed above affect (a) diagnostic accuracy in common cases presenting to the emergency department, (b) the diagnostic process (i.e., set of hypotheses, extent and quality of testing) and (c) confidence in the diagnosis. We also examine the moderating effects of case difficulty, as an earlier study indicated that DDXC may improve diagnostic performance only in more difficult cases. 21 2 | METHOD

| Participants
All medical students in their fourth (out of six) academic year at the Charité -Universitätsmedizin Berlin (N = 300) were eligible to participate in the present study. The required sample size was estimated using G*Power for a priori power analysis. 41 Assuming a large effect size of d = 0.8 for diagnostic accuracy, 42 the total required sample size for an analysis of variance (ANOVA) with four groups (α = 0.05, power = 0.80) was 76.

| Materials
Advanced medical students took part in the norm-referenced computer-based Assessing CLInical REasoning (ASCLIRE) test, 43 where the task is to diagnose six simulated patients with dyspnea, a common symptom in the emergency department. 44 Patients portrayed by the same male standardised actor with case-specific prototypical symptoms and makeup were presented at the beginning of each case in 30-s video clips showing shortage of breath. 42,43,45 Per case, participants were then presented with 30 diagnostic tests on their computer screen from which they were free to choose the type, order and number of tests to administer; repeated acquisition was allowed. When clicking on a test, participants were presented with the test result in the form of a text (e.g., pulse rate), an image (e.g., ECG and chest X-ray) or an audio clip (e.g., heart sounds and history) they had to interpret. Participants could finish data gathering at any time to provide a final diagnosis and related confidence on the next screen. They then proceeded to the next case.
There were four study conditions ( Figure 1): in the prompt condition, participants were instructed to write down all diagnostic hypotheses they considered after they knew the chief complaint and during the diagnostic process and to rank the generated diagnoses before embarking on a final diagnosis. In the two DDXC conditions, participants received additionally to the 'prompt' instruction a list of 17 (alphabetically ordered) diagnoses at the beginning of the experiment. Two versions of the list were designed in a counterbalanced fashion: version 1 contained the correct diagnoses for cases 1 to 3 but not for cases 4 to 6, and vice versa for version 2. Participants in the DDXC conditions were randomly assigned to receive version 1 or 2. We treated the three cases for which the checklist contained the correct diagnosis and the three cases for which the checklist did not contain the correct diagnosis as either DDXC with correct diagnosis provided condition (DDXC+) or DDXC without correct diagnosis provided condition (DDXCÀ), respectively. In other words, whether the DDXC contained the correct diagnosis or not varied within subject. Participants were pointed to the fact that the DDXC would not necessarily contain the correct diagnosis. They were instructed to list all generated diagnoses including diagnoses from the DDXC as well as other diagnoses they thought of. In the control condition, participants were only asked to type in their final diagnosis at the end of their information search.

| Ethics
Participation was voluntary and the Institutional Review Board of the Charité -Universitätsmedizin Berlin granted study approval (EA4/096/16).

| Case difficulty
Case difficulty was determined as the mean accuracy across students per case across conditions. 43,45

| Characteristics of generated lists
We recorded the number of diagnostic hypotheses considered and whether the correct diagnosis was on the self-generated lists in the intervention conditions and, if so, ranked in which position.

| Diagnostic accuracy
Diagnostic accuracy of all final diagnoses was rated by three experts as either incorrect (0) or correct (1). 43 Experts were board-certified internists and emergency physicians blinded to the study condition.
Inter-rater reliability was almost perfect (Fleiss kappa = 0.841, P < 0.001). 46 We used a majority rule to aggregate their ratings. In a previous study, mean accuracy ranged between 0.36 and 0.78, depending on the case. 43

| Information acquisition measures
Similar to previous studies, 42

| Confidence
Participants recorded their confidence in the correctness of the diagnosis on a 10-point ordinal scale ranging from 10 (least confidence) to 100 (highest confidence). 47,48 From this, we calculated a selfmonitoring index per person across cases to capture the extent to which confidence varied as a function of the likelihood of being correct. [49][50][51] In detail, we first calculated, per person, the mean confidence for all the person's incorrect answers and the mean confidence for all the person's correct answers. We then subtracted the mean confidence for incorrect answers from the mean confidence for correct answers per person. This difference constitutes the selfmonitoring index. 47,48 Self-monitoring indices may thus range from +90 (perfect self-monitoring, if participants indicated the highest confidence of M = 100 for all their correct answers and the lowest confidence level of M = 10 for all their incorrect answers) to À90.

| Statistical analyses
We used linear mixed-effects models to analyse the impact of our interventions on diagnostic accuracy and the effect they have on the diagnostic process and confidence. We used the procedures provided in the lme4 package for fitting the linear mixed models 52 in R. 53 To address the research questions, we fitted nine separate models: first, three models with list characteristics as the dependent variables (list length, position of correct diagnosis on participants' lists and accuracy of listed diagnoses); second, one model including diagnostic accuracy as the dependent variable; third, three models with indicators of data-gathering behaviour as dependent variables (number of tests, relevance of test and time on case) and, finally, two models where participants' confidence and self-monitoring indices were entered as dependent variables.
Across all models, we included subject ID as random intercept.
Furthermore, we included condition (prompt, DDXC+, DDXCÀ and control) and case difficulty (i.e., mean accuracy per case across participants) as fixed effects in all models. Where meaningful, we included the accuracy of the final diagnosis per participant as a fixed effect, too.
We used conventional thresholds for levels of statistical significance (P < 0.05; P < 0.01; P < 0.001) and did not adjust these thresholds for multiple comparisons because we did not consider spurious statistically significant effects to be an issue in the current application.

| Participants
Ninety students in their fourth academic year participated in the study. Random assignment to condition was effective; participant groups were similar with regard to the proportion of females, mean age, clinical semesters and task-specific medical knowledge (Table 1).

| Outcome measures
In the following, we will report for each outcome variable first the descriptive results and then results from mixed-effects models. Please refer to Table 2 Table S1).
Next, we determined the accuracy of the listed diagnoses, that is, whether participants' own lists contained the correct diagnosis or not.
We found that, compared to the prompt condition (M = 0.69, Results from mixed models revealed no statistically significant differences in the number of tests between conditions, nor between cases of different difficulty levels, but between correctly versus incorrectly solved cases, P = 0.040 (Table S3). They also revealed an effect of condition on the time spent on the case, P < 0.001 as well as of final accuracy, P = 0.004, but not of case difficulty. Note: In all but the self-monitoring analyses, the number of cases analysed was n control = 180, n prompt = 186, n DDXC-= 87 and n DDXC+ = 87. In the selfmonitoring analyses, the number of participants analysed was n control = 29, n prompt = 28, n DDXC-= 22 and n DDXC+ = 17. Results in bold indicate P < 0.05; for a complete report of the results of mixed-effects models, see the supporting information.
a Differences between the mean of the condition and the baseline, estimated from the mixed-effects models that control for case difficulty and, where appropriate, for accuracy of the final diagnosis; for a complete report of the results of mixed-effects models, see the supporting information. b Note that the control group did not generate a differential diagnosis list.
Relevance of acquired tests was similarly high across conditions,  (Table S4). Differences in confidence between cases of different difficulty were not statistically significant, but between correctly and incorrectly solved cases, P < 0.001. There were no statistically significant differences in self-monitoring between conditions.

| DISCUSSION
Diagnostic checklists have been proposed as a handy means to reduce diagnostic errors, especially missed diagnoses. 14,28 In this experimental study, we investigated how the use of a prompt to generate and write down alternative diagnoses (prompt) and, additionally, providing a differential diagnosis checklist (DDXC) affect the diagnostic accuracy as well as information gathering and confidence. Results indicate that using a generic prompt does not improve diagnostic accuracy, which concurs with previous research. 21,26,30,54 Results further indicate that only providing a DDXC that included the correct diagnosis increased diagnostic accuracy, which is in line with one previous study. 21 In fact, unlike this previous study, the benefit was not restricted to difficult cases. However, when the DDXC did not contain the correct diagnosis, accuracy tended to be lower than in the control group that did neither contain the instruction to list hypotheses nor a differential checklist, although not statistically significantly.  13 Also, previous research suggests that only early considerations of alternative diagnoses improve accuracy. 7 Thus, future studies should investigate whether variations to our prompt that specify the time (e.g., as early as possible) or number of alternative diagnoses and/or include multiple steps yield better results.
Together, our findings corroborate with research suggesting that providing content-specific knowledge (here in the form of symptomrelated checklists) rather than general debiasing procedures 33 may increase performance of (junior) physicians. 27,32,58 At the same time, the finding that decision aids such as our DDXCs may be potentially harmful in case of being incomplete calls into question their overall benefit, given that it is impossible to list all relevant diagnoses in practice. Even more sophisticated ways of presenting diagnostic decision support, such as in the form of a computer aid, bear the potential to be incomplete or even incorrect. Therefore, more systematic research into the effect of missing and incorrect advice on diagnostic accuracy is needed. 37,59

| Limitations
Our findings of limited benefits of checklists may call into question the suitability of checklists for the diagnostic process as a whole.
However, it needs to be acknowledged that we tested only three types of interventions here. Also, our experimental setup sets limitations to the generalizability of our results to clinical practice, to different presenting complaints, to more experienced physicians and other (less ill-defined) settings. Also, generalisability to more atypical case presentations is limited, as cases here were designed as rather unambiguous, prototypical case presentations, which are common in the emergency department. Previous studies have argued that common cases tend not to benefit from (checklist-induced or instructed) reflection. 13,21 Thus, although some cases were difficult (i.e., solved by few), they might not have been difficult because they were uncommon or ambiguous but because of knowledge deficits at this stage in medical education. Comparability with previous vignette studies is also limited as we did not use vignettes that present all material at once 21 but rather cases that allowed for self-directed navigation through the diagnostic process, which has the benefit of better resembling real-life decision making.

| Conclusion
To conclude, we provide evidence of the potential benefit of using a symptom-specific differential diagnosis checklist during the diagnostic process for diagnostic accuracy without altering data-gathering behaviour. At the same time, our finding that the correct diagnosis needs to be on the checklist limits the potential benefit of this type of intervention and suggests that how to design the most effective checklists and how to integrate them in the diagnostic process are still important questions to be answered.