We examined whether a ‘hint’ manoeuvre increases the time novice medical learners spend on reviewing a radiograph, thereby potentially increasing their interpretation accuracy.
We examined whether a ‘hint’ manoeuvre increases the time novice medical learners spend on reviewing a radiograph, thereby potentially increasing their interpretation accuracy.
Senior year medical students were recruited into a randomised control, three-arm, multicentre trial. Students reviewed an online 50-case learning set that varied in degree of ‘hint’ intervention. The ‘hint’ was a dialogue box that appeared after a student submitted an answer, encouraging the student to re-evaluate their interpretation. The students in the control group received no hints. In the weak intervention group, students received ‘hints’ with 66% of their incorrect interpretations and 33% of those that were correct. In the strong intervention group, the incorrect interpretation hint frequency was 80%, whereas for correct responses it was 20%. All students completed a 20-case post-test immediately and 2 weeks after the 50 cases. The primary outcome was student performance on the immediate post-test, measured as the ability to discriminate between normal and abnormal films (dPrime). Secondary outcomes included the probability of considering the hint, time spent on learning cases and knowledge retention at 2 weeks.
We enrolled 117 medical students from three sites into the three study groups: control (36), weak intervention (40) and strong intervention (41) groups. The mean (standard deviation) dPrime in the control, weak and strong groups were 0.4 (1.1), 0.7 (1.1) and 0.4 (0.9), respectively (P = 0.4). In the weak and strong groups, participants reconsidered answers in 556 of 1944 (28.6%) hinting opportunities, and those who reconsidered their answers spent a mean (95% confidence interval) of 13.9 (11.9, 16.0) seconds longer on each case. There were no significant differences in knowledge retention at 2 weeks between the groups (P = 0.2).
Although the implemented hinting strategy did result in students spending more time considering a proportion of the cases, overall it was not effective in improving student performance.
Radiographs are one of the most commonly ordered diagnostic tests in medicine. Medical educators strive to ensure that learners attain competence in the skill of radiograph interpretation. Online education enables learners to review large item banks of digitised radiographs and it is becoming increasingly popular as a supplement to learning that occurs during clinical rotations.[1-3] However, the most effective instructional strategies for online learning require further investigation.
In our previous work, we examined the effectiveness of web-based learning radiograph interpretation using deliberate practice as the dominant instructional strategy.[5, 6] We studied the performance of 20 medical students, 18 paediatric residents, five paediatric emergency staff and three radiologists who were asked to classify 234 cases of ankle radiographs as ‘normal’ or ‘abnormal.’ It was surprising to find that medical students spent almost 50% less time on each radiograph case when compared with radiologists (mean 26.2 versus 52.1 seconds) in the face of their significantly lower skill and performance.[7, 8] One possibility explaining the medical students' shorter time examining the radiographs may be that they are abandoning their visual searches for pathology prematurely, a known cause of interpretation error.[7, 8] In this research, we considered whether an intervention to promote longer searches for pathology can improve the learning of radiograph interpretation by novices.
We developed a novel instructional strategy in the form of a ‘hint’ as a means of mitigating the tendency of students to abandon the search for pathology prematurely. Four distinct literatures informed our conceptual model of the hinting strategy: (i) psychological theories of radiograph interpretation; (ii) theories of search satisficing; (iii) taxonomies of hints or suggestive feedback; and (iv) multimedia instructional design.
The cognition of radiograph interpretation has been studied by a number of investigators.[8, 9] Kundel et al.[8, 9] presented a model of radiology cognition based on eye-tracking data and described abnormality detection occurring in four steps: orientation to the radiograph, scanning of the radiograph, visual pattern recognition and decision making. Of the four elements, the time measurement in our previous work suggested that a relatively short scanning time may have contributed to medical student error in identifying pathology.
In searching a given space, humans typically use one of two strategies for knowing when to call off the search: ‘search satisficing’ defined as calling off a search once it is ‘good enough’,[10-12] or ‘search maximising’, where the search is called off only once all of the available information is considered. The search satisficing heuristic can lead to error in the form of premature closure. In those who engage in search satisficing to the point of error, it would be beneficial to introduce external feedback such that it shifts the learner to work harder to consider more of the available data to optimise their decision making, resulting in a search ‘maximising’ strategy.
Hinting can provide a source of external feedback and is a well-known instructional strategy in which an instructor either provides additional information or highlights known information. Hints can take many forms, ranging from the raised eyebrow of the physically present tutor to the direct provision of probabilistic information from an artificial intelligence-based computer tutoring system, and have been studied in domains where problem-solving scripts can be elaborated in detail, as in mathematics instruction.[15-17] Hume et al.  studied tutors interacting with medical students learning about cardiovascular physiology and classified hints into three categories. Some hints conveyed information that the student did not otherwise have, enabling them to proceed with the problem where they might otherwise not have been able to do so. In cases where the students possessed the information but needed guidance, a hint pointed to the significance of information to allow the learner to solve the problem. Finally, a hint could take the form of a directed line of reasoning, where a tutor outlines the sequence of steps required to solve the problem through a series of linked questions. Tsolvatzi et al.  elaborated Hume et al.'s taxonomy of hints by noting that hints can be adapted to a situation by moderating the method of elicitation of the information to be given away. In passive elicitation, the student is simply given partial information that brings them closer to a solution. In active elicitation, the tutor guides the student indirectly to self-generate the missing information. Finally, Tsolvaltzi et al. identified a problem referential dimension to hinting. That is, hints were generally adapted to the context of the tutorial session depending on the learner's responses or the nature of the specific problem being considered. Table 1 lists examples of these principles.
|Hint type or dimension||Example|
|Domain knowledge hint – conveys key declarative knowledge from the domain that is required to solve problems.||Growth plates are the weakest part of the paediatric musculoskeletal system.|
|Inferential hint – addresses whether the key declarative knowledge has been recognised.||In examining the radiograph, did the subject pay particular attention to the growth plates?|
|Elicitation status of hint – the degree to which the student is able to self-generate the missing information or inference.|| |
Passive elicitation requires little effort on the student's part (e.g. ‘Did you pay particular attention to the growth plates?’).
Active elicitation requires the student to generate the inference (e.g. ‘To which part of the bone should you pay particular attention?’).
|Directed line of reasoning (scaffolding) – hints that move the student along a series of steps to arrive at the answer.||‘First you should examine the surface of the talus in the AP view’|
|Problem referential dimension – these hints take into account the tutoring session and this particular learner.||‘You have submitted an answer without examining all three views’|
|‘Is this case similar to the previous one where the mechanism of injury was the same?’|
|Hint specificity – hints can range from being so general as to be of very little help to so specific as to give away the answer.|| |
‘You may possibly be incorrect.’
‘You are probably incorrect.’
‘You are incorrect.’
‘You are incorrect – examine the lateral view.’
‘You are incorrect – examine the distal fibula on the lateral view.’
To our knowledge, these hinting taxonomies have not been empirically validated such that optimal hinting principles can be cited in the design of our current work. However, we incorporated two of the principles described by Hume et al. and Tsolvatzi et al. to create a hinting strategy to enhance learning from our web-based radiograph interpretation learning system. Specifically, we used the hinting principles of active elicitation and pointing to significant information. In addition, we incorporated the multimedia design principle of reflection to help medical learners increase their cognitive skill when reviewing a radiograph by questioning whether their approach to interpreting the radiograph was optimal. Thus, in a predefined proportion of cases, once a student had classified a radiograph as ‘normal’ or ‘abnormal,’ a hint dialogue box would appear, informing them that the appearance of the hint probably meant that they had made an error and encouraged them to reconsider their answer (Fig. 1). Our research hypothesis was that this hint strategy would mitigate the student tendency to abandon their search for pathology early and result in learning over and above that achieved with explicit feedback alone.
This was a prospective, randomised control, three-arm multicentre trial whereby the experimental groups received a ‘hint intervention’ during the deliberate practice of reviewing serial ankle radiographs.
In previous research, we assembled 234 ankle radiographs taken for the purposes of excluding a fracture. Paediatric ankle radiographs were selected because they are a commonly ordered radiograph among frontline providers and clinical decision making is dichotomous. From the institutional Picture Archiving and Communication System, we downloaded the three standard ankle radiograph views (anteroposterior, lateral, mortise) from each case and saved each image in JPEG format with the respective final staff paediatric radiology report. For each case, we abstracted a brief clinical history and categorised each case as either normal or abnormal based on the official radiology report.
Before the radiographic cases were reviewed, there was an introductory interactive tutorial on paediatric ankle radiograph interpretation. For the presentation of the cases, html, php and flash 9 professional (Adobe Systems, Inc., San Jose, CA, USA) were used to develop a computer program that would allow the practice of radiograph interpretation with immediate feedback after each case. The participants started with reviewing the case history and were then asked to submit their interpretation as either ‘probably normal’, ‘definitely normal’, ‘probably abnormal’ or ‘definitely abnormal’. If the answer was categorised as abnormal, the participant also indicated the location of the suspected fracture. Although there were four possible response options, for the purposes of this study, the responses that included ‘normal’ were merged into one category and the responses that included ‘abnormal’ were merged to create the second possible response option. Once the participant was committed to his or her diagnosis, the program provided immediate feedback by highlighting pathology on the abnormal images and providing the radiologist's report. All responses were written to a mysql database.
From the 234 cases, we selected a 50-case learning subset based on the specific diagnosis and the radiograph's Rasch item difficulty index, which had been determined in previous research. To select the 50 learning cases, we designed a computer algorithm using labview8.6 software (National Instruments, Austin, TX, USA). Monte Carlo simulations were carried out to repeat a computer algorithm (500 000 iterations) that extracted a set of 50 cases with the following characteristics: (i) an abnormal to normal ratio of 50% to optimise learning from cases; (ii) a rate of specific pathology consistent with the overall item bank; (iii) values of Rasch item difficulty to ensure an overall difficulty of the item set appropriate to medical students; and (iv) an internal consistency Cronbach's alpha of >0.8. The aim of using this method of case selection was to generate case sets that were reliable and valid for the target audience.
The post-tests consisted of 20 cases that were not in the 50-case learning set. Monte Carlo simulations on the remaining original 184 cases were used to repeat a computer algorithm (500 000 iterations) to extract 20 cases that resulted in an internal consistency of at least 0.8, a 50% abnormal to normal ratio and included examples from each of the diagnostic categories. This post-test was given initially after the first 50-case learning set (post-test 1) and then again 2 weeks later (post-test 2) to evaluate learning retention. The 20 post-test cases were presented in a fixed order common to all participants and no feedback was provided during the review of these cases.
We designed the hinting intervention such that it would encourage students to reconsider the case (i.e. prolong their search) by having a dialogue box pop up and suggest that a participant's submitted answer may not be correct (Fig. 1). The aim of this ‘hint’ was that it would teach the students to be more persistent in searching for abnormalities. Having the dialogue box appear with every case where the participant was incorrect would have been equivalent to direct feedback. Instead, we sought to keep the conditions of uncertainty by leaving some possibility that the participant was in fact correct. We selected ratios that would provide an incentive to go back and reconsider an answer (hints were elicited more often with incorrect responses). Because we could not be certain of what level of uncertainty would be optimal, we created two groups with different probabilities that the student was in fact correct to uncover any dose–response relationships. Thus, the frequency with which hints appeared varied with the intervention intensity. In the weak hint intervention group, a hint dialogue box was presented 66% of the time when the answer was incorrect and 33% of the time when the radiograph interpretation was correct. In the strong hint intervention group a hint dialogue box was presented 80% of the time when the participant answered incorrectly and 20% of the time when the radiographs were interpreted correctly. The ‘hints’ were only displayed during the 50-case learning set and not during the post-tests, which were common to all groups. The control group had no hinting intervention. There was no time limit imposed while the student reconsidered the diagnosis.
We approached senior year medical students from three universities (sites 1–3) via e-mail. Students from multiple sites were selected to enhance the generalisability of the results. Potential participants were required to have access to the internet and to hold a personal e-mail address. The research ethics boards at all the participating institutions approved this study.
The students at each site were approached in accordance with regulations set at each university. The students from sites 1 and 3 were recruited via electronic solicitation on the listserv specific to senior level medical students at that site. Participants from site 2 were approached while on a radiology selective. The radiology selective was 4 weeks in duration, with between 20 and 30 students enrolled each month. These students were approached at a group teaching session in the first week of their rotation. Interested students from any site replied to the research coordinator via e-mail. The coordinator obtained consent for study participation and enrolled students were provided a link to the website and a unique username and password.
Upon logging on, students were randomly assigned to one of the three groups: control, weak intervention, strong intervention (Fig. 2). A randomisation program provided concealed allocation. Randomisation was performed using block sizes of six and participants were allocated to the three training sets using ratios of 2 : 2 : 2. To reduce the probability of selection bias, the blocks were stratified by institution. Students were blind to training set differences and therefore did not know their group assignment.
Students in all three groups first completed an introductory tutorial that consisted of 35 screens describing the anatomy and major fracture patterns for ankle radiographs in children. This was immediately followed by completion of the previously described 50-case learning set, which varied in degree of ‘hinting’ intervention (control, weak, strong) and ended with the 20-case post-test 1. Two weeks after the first post-test was completed, the participants were sent a reminder to log on to the system to complete the 20-case post-test 2. The 50 learning cases were presented in a random order unique to each participant, whereas the 20 post-test cases were presented in a fixed order common to all participants. Students who completed both phases of the research were provided with a $20 gift certificate and a certificate of completion.
The signal detection parameter of discrimination, dPrime, was the primary outcome. dPrime measures how well an observer can tell the difference between normal and abnormal events. In ankle radiograph interpretation, dPrime assesses the ability to discern radiographs with a fracture versus those without a fracture; in general, a novice diagnostician has difficulty discriminating between normal and abnormal films resulting in a small dPrime, whereas an expert can separate the images in terms of their degree of pathology and this results in a large dPrime. In similar research, this parameter has been shown to be a good measure of learner performance. Secondary outcome measures included the probability of using a hint strategy measured as those who reconsidered answers after hint presentations; and time (measured in seconds) spent reviewing a case before an answer was submitted. We also determined the reliability of the post-test cases measured as item-test correlations and Cronbach's alpha. Finally, we plotted the learning curves and determined Cohen's d effect size of the 50-case educational learning set.
From our previous work, medical students read ankle radiographs at a dPrime (standard deviation) of 0.65 (0.48), whereas for senior paediatric resident dPrime was higher by 0.35. Thus, using a minimally significant change of 0.35, alpha = 0.05, beta = 0.8, standard deviation = 0.48, the estimated minimal sample size for a three sample comparison of mean dPrime is 30 participants per group. It is our experience that of the participants who have started similar interventions, about 25% did not complete it. Therefore, we inflated our required sample size to a minimum of 40 per group (120 in total) to account for students who would not complete the protocol.
All analyses were intention-to-treat analyses. Each completed radiograph was considered one item. Items were considered correct if the participant's response matched the original radiology report or incorrect if the submitted answer did not match the report. For abnormal items, to obtain a full score, the participant was required to match the radiology report and correctly indicate the specific region of abnormality on the image.
To determine the effectiveness of the hinting intervention, we specified a priori a random-intercept linear model in which the dependent variable was the signal detection parameter of discrimination (dPrime). Additional elements for the model were the study group (three levels) and the time of the post-test (two levels nested within each student) and their interaction. To test for differences by institution, we conducted one-way analyses of variance (anovas) on the main outcome variables and a two-way comparison to test for an interaction between the institution and the randomisation group. To determine the relationship among subgroups, pairwise comparisons were carried out using a Bonferroni adjustment. As a subgroup analysis, we also investigated whether students who used the hints as intended (per protocol) achieved significantly higher post-test scores. We calculated the mean number of hints accepted per individual and then created a dichotomous variable, at the participant level, which was positive if an individual accepted hints at a rate above the median and was negative if the rate was below the median. We then compared post-test scores in these ‘positive’ versus ‘negative’ groups. Finally, to determine the component measures of knowledge, we used the mean accuracy on the first 20 cases and the post-test results to calculate Cohen's d effect size.
Enrolment occurred from April 2011 to February 2012. Of the 988 medical students invited to participate, 228 (23.1%) replied with an interest to participate. Of those who expressed an interest, 154 (67.5%) consented to participate and were randomised to one of three intervention groups. After randomisation, 21 (13.6%) dropped out before they had completed the 50 learning cases and post-test 1; a further two participants dropped out before post-test 2. In total, 117 participants completed the primary outcome (Fig. 2), with 36, 40 and 41 in the control, weak and strong groups, respectively.
During the 50-case learning set, the computer algorithm successfully delivered the hints in the ratios described above. Overall, in 556 of 1944 (28.6%, 95% confidence interval [CI] 26.6, 30.6) hinting opportunities, students clicked on the dialogue box to reconsider their answer and there was considerable inter-individual variability in acceptance of the hints with a median of 23% (interquartile range 11%, 38%) of the hints being accepted. The strong intervention group accepted hints 9% more often than the weak group (95% CI for difference 5%, 13%). The probability of reconsidering a response after a hint decreased with the number of cases completed in both hinting groups (r2 = −0.75). The intervention groups showed qualitatively superior scores on the 50 learning set cases compared with the control group (Fig. 3).
The mean (standard deviation) times spent on the case in the control, weak and strong intervention groups were 44.7 (6.6), 48.6 (7.4), 48.7 (7.0) seconds, respectively (P = 0.7) (Table 2). Students spent an average of 13.9 seconds (95% CI 11.9, 16.0) longer on each case where they accepted a hint. Students in the hinting groups did not spend any longer on cases in the post-tests than the students in the control group (P = 0.62). Over the 50 learning cases, the time spent on each case decreased with increasing case count (r2 = −0.76)
|Process measures||Control n = 36||Weak hint n = 39||Strong hint n = 40||P value|
|Median (range) time to complete learning case set, minutes||34.6 (9.4, 71.1)||36.7 (12.4, 127.7)||34.8 (11.8, 102.5)||0.7|
|Proportion (range) of hints accepted, percentage||N/A||24.7 (0, 91.3)||32.1 (0, 88.9)||0.2|
|For cases where hint accepted, median (range) of additional time spent on case (seconds)||N/A||6.9 (1.2, 300.0)a||6.7 (1.3, 192.6)||0.5|
The 20 items in both post-tests showed acceptable item characteristics. The range of item difficulties was good (0.28–0.97); mean (standard deviation) = 0.59 (0.20), with only one item showing ceiling effects where almost every participant answered it correctly. Cronbach's alpha measure of reliability was acceptable for a test with only 20 items at 0.68 and 0.67 for post-tests 1 and 2, respectively.
The 50-case learning set was successful in improving the students’ performance in correctly interpreting the radiographs. Cohen's d effect size for the control group in post-test 1 was 0.6 (95% CI 0.2, 0.9).
The students' post-test dPrime scores, according to study group, are reported in Fig. 4. In general, the weak intervention group scored the highest on both post-tests, although this difference was not statistically significant compared with either or both other groups (P = 0.4). The decreases in post-test 2 scores from post-test 1 scores were also comparable between the three groups (P = 0.2).
The multivariate model where performance (dPrime) on the post-test is predicted by the combination of the study group, the timing of the post-test and their interaction, was highly significant (χ2 = 22.9, P < 0.001), but there was neither a statistically significant main effect for the study group nor a significant interaction with the timing of the post-test. Adding the institution to the model did not change the results. In the per-protocol subgroup analysis, there was no difference in either post-test scores for the students who were above versus those who were below the median in terms of accepting hints.
Using a novel electronic hint strategy, we sought to increase the persistence with which medical students searched for abnormalities in a set of radiographs to improve their diagnostic performance. Our process measures demonstrated that when a student used the hint, they spent more time on the case and were more likely to interpret the case correctly. However, relative to the control group, the hint groups did not demonstrate greater learning on the post-test scores where no hint was provided.
Hints are meant to give away a small amount of information while still leaving the student to work actively towards the final answer.[14, 16, 23] In this study, we created and operationalised a hint strategy based on the hinting principles described by Hume et al.[16, 17] and Tsolvaltzi et al.[16, 17] and the cognitive model of radiograph interpretation developed by Kundel et al. However, students were probably limited in the other necessary skills for the successful detection of radiological abnormality, and thus the hint we provided to the students was probably insufficient for the relatively novice medical student.
The hinting strategy used could be improved using a number of possible instructional manoeuvres. Per Tsovlatzi et al.'s[16, 17] and Hume et al.'s[16, 17] hint taxonomies, we could convey (more) information in the hint (e.g. which view to look at) or base the hint on a directed line of reasoning (e.g. suggest where to look in what sequence). In addition, although the web-based radiograph interpretation learning system with hinting incorporated recommended multimedia learning principles, including pre-training, feedback, reflection and student-controlled pacing, it lacked guided activity where the learner interacts with an electronic pedagogical agent to help guide their cognitive process. If the latter were added, it may aid in enhancing student motivation, as student engagement was probably a factor in the observed decrease in time on the case and hint use over time.
The retention post-test demonstrated a decrease in learning in the 2 week interval from the immediate post-test, and there were no significant differences in degree of skill decay between the study groups. As the hint had no significant impact on the immediate post-test scores, it was not surprising that this was also true for the retention post-test. The pattern of decline of post-test scores within 2 weeks in all groups is consistent with what has been documented in the literature, whereby skill decays over time when it is not reinforced. Two weeks has been reported as the limit for most humans in retaining detailed new information, and is the recommended timeframe for test–retest reliability testing of a measurement tool.
This research has limitations that warrant consideration. Although we randomised the students in the study, they were a non-random sample of the larger student population. Students who participated may have had an interest in new media approaches to learning, limiting the extent that these results can be generalised to all medical students. We did not collect quantitative data on the reasons for the lack of reconsideration of cases after the hint was presented, and thus are only able to speculate on the reasons why the hint groups did not perform better than controls. The immediate feedback provided by the software gave information to the participant that may have influenced responses to cases reviewed later in the session and may have influenced the student's propensity to use the hint information.
In conclusion, our study was a carefully designed investigation of one form of hinting on an existing web-based learning platform. The educational intervention was based on cognitive studies of radiology interpretation and previous research. The study was adequately powered and incorporated a test of knowledge retention. Nevertheless, the hint format used in this research was not effective in improving student learning beyond what was achieved with deliberate practice. Therefore, future studies of such learning interventions might profitably investigate design features that include hints oriented towards the recognition of specific visual features and correct decision making.
KB is the primary and corresponding author. KB primarily designed the work, oversaw all research operations in patient enrolment, data collection and entry, was involved in the analysis and the interpretation of the results and wrote most of the manuscript. MP provided intellectual and technical content in study design and execution, was the overall study coordinator, contributed to analyses, prepared all figures and critical revisions and approval of the manuscript. MS, SG and JR provided substantial acquisition of study data at their respective sites, critical revisions to the manuscript, and approval of the final version to be submitted. JA provided critical information on research design and conception, contributed to the acquisition of data, critical revisions and final approval of the manuscript. Finally, MVP is the senior author and contributed significantly to the intellectual design, grant application, research operations, performed most of the analyses and provided extensive revisions to the manuscript and approved the final version.
The authors would like to thank Dr Martin Nachbar and Dr Adina Kalet for his careful review of several drafts of the paper.
Educational development fund, University of Toronto.
the research ethics boards at all the participating institutions approved this study.