Comparison of student examiner to faculty examiner scoring and feedback in an OSCE
Geneviève Moineau, University of Ottawa, Faculty of Medicine, 2038A-451 Smyth Road, Ottawa, Ontario K1H 8M5, Canada. Tel: 00 1 613 562 5800/8561; Fax: 00 1 613 562 5651; E-mail: email@example.com
Medical Education 2011: 45: 183–191
Objectives To help reduce pressure on faculty staff, medical students have been used as raters in objective structured clinical examinations (OSCEs). There are few studies regarding their ability to complete checklists and global rating scales, and a paucity of data on their ability to provide feedback to junior colleagues. The objectives of this study were: (i) to compare expert faculty examiner (FE) and student-examiner (SE) assessment of students’ (candidates’) performances on a formative OSCE; (ii) to assess SE feedback provided to candidates, and (iii) to seek opinion regarding acceptability from all participants.
Methods Year 2 medical students (candidates, n = 66) participated in a nine-station formative OSCE. Year 4 students (n = 27) acted as SEs and teaching doctors (n = 27) served as FEs. In each station, SEs and FEs independently scored the candidates using checklists and global rating scales. The SEs provided feedback to candidates after each encounter. The FEs evaluated SEs on the feedback provided using a standardised rating scale (1 = strongly disagree, 5 = strongly agree) for several categories, according to whether the feedback was: balanced; specific; accurate; appropriate; professional, and similar to feedback the FE would have provided. All participants completed questionnaires exploring perceptions and acceptability.
Results There was a high correlation on the checklist items between raters on each station, ranging from 0.56 to 0.86. Correlations on the global rating for each station ranged from 0.23 to 0.78. Faculty examiners rated SE feedback highly, with mean scores ranging from 4.02 to 4.44 for all categories. There was a high degree of acceptability on the part of candidates and examiners.
Conclusions Student-examiners appear to be a viable alternative to FEs in a formative OSCE in terms of their ability to both complete checklists and provide feedback.
The objective structured clinical examination (OSCE) is widely used to evaluate the performance of medical students for both formative and summative purposes. Although OSCE scores can be a reliable and valid measure of an examinee’s clinical skills, the OSCE puts a high demand on doctor resources.1 Doctors often serve as examiners as they are considered ‘expert’ judges and many medical schools use doctors to provide feedback to candidates during the examination process. Difficulty in recruiting doctor-examiners has led to the use of alternative raters, such as standardised patients (SPs).2–4 Another potential alternative to doctors, and the one of interest to us, is the use of student-examiners (SEs).
Medical students as evaluators of other medical students have been studied in various contexts in medical school. Medical students have been shown to be similar to faculty staff in rating lectures and written papers on the history of medicine.5,6 In the clinical setting, Arnold et al.7 found that peer ratings during an internal medicine rotation were internally consistent, unbiased and valid. In the OSCE setting, Chenot et al. evaluated student-tutors’ ability to perform as OSCE examiners on history-taking stations. Student-tutors gave slightly better average grades than teaching doctors (differences of 0.02–0.20 on a 5-point Likert scale). Inter-rater agreement at the stations ranged from 0.41 to 0.64 for checklist assessment and global ratings; overall inter-rater agreement on the final grade was 0.66. Examinees (64%) felt they would receive the same grades from student and faculty examiners and 85% felt that SEs would be as objective as faculty examiners (FEs).8 Ogden et al.9 compared dental students with dental faculty staff in the assessment of medical students on a mouth examination station during an OSCE and found that dental students were equivalent to dentists as examiners on this physical examination station.
Studies also support the use of medical students to provide feedback. Reiter et al.10 studied the feedback provided to medical students by their peers during an OSCE. Upon leaving the station at which they had been examined and given feedback by a student, a resident or a faculty member, examinees were asked to complete a questionnaire on the quality of feedback received. The quality of feedback from the SEs was deemed to be superior to that received from resident or faculty examiners. Examinees judged SEs as acceptable. A possible confounder in the study by Reiter et al.10 is that examinees received their marks immediately prior to rating the feedback. The authors report that statistically significantly higher marks were given by the SEs, which may potentially have biased the results.10
Given the potential cost benefits of using SEs, it would be of interest to study the potential use of SEs in a medical student OSCE testing both history and physical examination skills and to review their ability to provide feedback. In addition, a more rigorous evaluation of SE feedback through direct observation by faculty members would add significantly to the existing literature. The purpose of this study was to compare expert FE and SE assessment of students’ performance on a formative OSCE. Although we are interested in comparing summative assessments between rater types, the main focus of this study is to compare scores on a formative assessment. As such, the feedback provided by SEs to the students will be examined and opinions regarding the acceptability of using SEs will be gathered from all participants.
A formative nine-station OSCE was administered to Year 2 medical students. All Year 2 and Year 4 medical students were invited to participate. Of 152 Year 2 students, the first 66 who volunteered to participate as examinees (candidates) were selected. They had never participated in an OSCE and received a 20-minute candidate orientation prior to the examination. The first 27 Year 4 student volunteers were selected to be SEs. They had participated as candidates in three prior OSCEs during medical school. Faculty examiners were all experienced clinicians who were involved in teaching and had examined in at least one prior compulsory OSCE. Before starting the OSCE, the FEs and SEs were given a 30-minute orientation to the research OSCE, as well as information on how to give feedback.
The OSCE consisted of nine stations testing clinical skills, including history-taking, communication and physical examination skills. In each station, an SE and FE simultaneously viewed the encounter and independently completed a checklist and a global rating scale. The global rating scale used a 6-point scale ranging from 1 = inferior to 6 = excellent, on which 3 = borderline unsatisfactory and 4 = borderline satisfactory. Although the global rating is designed to capture the overall quality of the examinee’s performance, its sole use is to establish a cut score, and therefore the rating is not considered when deriving a score for a station. The standard-setting procedure used was the modified borderline group method.11 It defines the cut score for each station as the mean checklist score for all the candidates identified as borderline. Each station was 8 minutes in length and was followed by 2 minutes of feedback provided by the SE to the candidate. In addition to scoring the candidates, FEs also evaluated SEs on the verbal feedback they gave to each candidate using a standardised evaluation form. The evaluation form for feedback was developed based on work by Ende12 and adapted for use in the OSCE. The form had been piloted in a previous OSCE. Students were able to complete the form in the allocated time and did not feel it negatively impacted on their performance (data not included). The evaluation form used a 5-point rating scale and evaluated different aspects of the feedback, such as whether it was balanced, specific, accurate and appropriate. The FE was also asked to rate the feedback overall and to indicate whether he or she would have given similar feedback and whether he or she had needed to intervene, and to assess SE professionalism. No additional feedback was given to the candidate by the FE unless the FE considered it important to intervene.
To accommodate all students, three identical tracks were run simultaneously for two administrations. The SEs remained stationary for the entire examination and candidates rotated through the stations. The FEs changed tracks at the halfway point in the examination to avoid having the same SE and FE pair for the entire examination. All candidates completed pre- and post-OSCE questionnaires and SEs and FEs completed post-OSCE questionnaires. The questionnaires explored participants’ opinions on whether the process was acceptable and their perceptions of it. All responses used a 5-point rating scale on which 1 = strongly disagree and 5 = strongly agree. Checklist station scores were created by converting each candidate’s total checklist score on that station to a proportion. The proportions were multiplied by 10 for reporting purposes.
Data analysis included comparing correlation coefficients for SE and FE checklist station scores and global ratings, as well as using paired t-tests to compare mean scores assigned by the examiners. Although the global rating was not used for scoring, we decided to consider it in the analysis because other research has found differences between non-doctor-examiners and doctor-examiners for rating scales, but not for checklists.2 The global rating was analysed in a similar fashion to the checklist scores. The reliability of each scoring instrument was also determined using a generalisability analysis. Mean scores and frequencies or percentages of response in each of the categories on the questionnaires were explored.
Comparing scores and global ratings
Table 1 displays the mean scores, standard deviations and correlations between SEs and FEs for checklist and global rating scales. The mean total checklist score for SEs (6.91) was higher than the mean total score for FEs (6.78) (F1,65 = 9.22, p = 0.003, d = 0.37). As shown in the table, the mean checklist score given by SEs on each station also tended to be higher than the corresponding checklist score provided by FEs. The mean global ratings of SEs and FEs did not differ (means = 4.81 and 4.82, respectively; F < 1, d = 0.04). Of more importance, Table 1 also displays the correlations between the checklist scores and global ratings provided by each rater type. The correlation between rater types was high for the mean total checklist score (r = 0.81) and ranged from moderate to high (r = 0.56 to r = 0.86) for scores on individual stations. The correlation between rater types was lower for the mean global rating (r = 0.68) and ranged from 0.23 to 0.78 for ratings on individual stations.
Table 1. Mean scores, standard deviations and correlations between student-examiners (SEs) and faculty examiners (FEs) for checklist and global rating scales
|1 HX||6.45 (1.07)||6.65 (1.02)||0.66||4.55 (0.93)||4.69 (0.90)||0.61|
|2 PE||6.99 (1.33)||6.91 (1.53)||0.86||4.89 (0.95)||4.95 (0.85)||0.57|
|3 HX||6.48 (0.93)||6.13 (0.98)||0.56||4.79 (0.69)||4.90 (0.82)||0.24|
|4 PE||7.08 (1.41)||7.03 (1.23)||0.83||5.02 (0.94)||5.09 (0.84)||0.78|
|5 PE||6.30 (1.39)||7.06 (1.30)||0.69||4.74 (0.88)||4.75 (0.87)||0.59|
|6 HX||7.59 (1.15)||7.26 (1.30)||0.81||4.92 (0.75)||4.48 (1.40)||0.23|
|7 PE||8.69 (1.16)||8.07 (1.59)||0.67||4.92 (1.04)||5.24 (0.79)||0.68|
|8 PE||6.12 (1.28)||5.59 (1.54)||0.74||4.68 (0.84)||4.49 (1.00)||0.52|
|9 HX||6.47 (1.23)||6.33 (1.20)||0.84||4.74 (0.85)||4.73 (0.83)||0.44|
|Overall mean||6.91 (0.52)||6.78 (0.57)||0.81||4.81 (0.40)||4.82 (0.46)||0.68|
A generalisability analysis was conducted to determine the reliability of the two measures. For the checklist, a repeated-measures analysis of variance (anova) was conducted with stations (1–9) treated as a repeated measures factor and rater type (SE, FE) nested within each station. An identical analysis was conducted for the global rating scale. Table 2 summarises the variance components for the checklist and global rating scale. Note that for both analyses, variance attributable to SPs could not be included because SP was confounded with station. As the table shows, candidates accounted for very little of the variance in scores for either measure, indicating that the cohort was probably relatively homogeneous. For both measures, the candidate × station interaction accounted for a large amount of the variability, which indicates that candidates’ scores varied from station to station. By contrast, the rater nested with station interaction did not account for a large amount of variability, which indicates that the measures between raters were relatively similar for each station. It is interesting to note that the amount of unexplained variance was higher for the global rating than it was for the checklist.
Table 2. Facets, variance components, percentage variance and standard errors for each rater type
|Raters (r): s||0.08||4||0.04||0.02||2||0.01|
To determine the reliability of the two examination measures, the variance components displayed in Table 2 were used to generate g-coefficients. G-coefficients were 0.48 for the checklist and 0.60 for the global rating, indicating that the global rating scale produced a more reliable measure of performance than the checklist, although both are moderate.
Table 3 summarises the mean scores for each of the categories on the feedback questionnaire. The mean total scores ranged from 4.02 to 4.44 (out of 5; 4 = agree, 5 = strongly agree). This indicates that the feedback provided by the SEs was rated highly by the FEs. The actual number and percentage (%) of ratings scored at < 3 (3 = neutral) were small for each category: professionalism, n = 0 (0%); balanced feedback, n = 6 (9.1%); specific feedback, n = 3 (4.5%); accurate feedback, n = 5 (7.6%); appropriate manner, n = 0 (0%); good feedback, n = 1 (1.5%), and similar feedback, n = 7 (10.6%).
Table 3. Student-examiner feedback: mean scores of faculty examiners’ assessment of feedback given by student-examiners to candidates*
|The student-examiner was professional in attitude and behaviour||4.71|
|The feedback given was balanced||4.41|
|The feedback given was sufficiently specific||4.35|
|The feedback was accurate||4.35|
|The feedback was given in an appropriate manner||4.54|
|Overall, the feedback given was very good||4.42|
|I would have given similar feedback||4.15|
Faculty examiner post-OSCE questionnaire
Table 4 displays the descriptive statistics associated with the FEs’ ratings of the SEs’ feedback. Faculty examiners agreed that SEs appeared comfortable evaluating their peers on history-taking skills, communication skills and physical examination skills. They disagreed that SEs had higher expectations than themselves or gave harsher feedback. There was agreement that students learn from being examiners. When asked if the OSCE should be delivered with SEs without faculty member presence, there was only slight agreement. However, FEs disagreed when asked if any tension was perceived between students. Although requested, no specific comments were received. Faculty examiners agreed that an SE OSCE could be used in a formative setting, but were overall neutral regarding such an examination in a summative setting.
Table 4. Student-examiner performance and perceptions
|Faculty examiner assessment of SE performance|
|SE appeared comfortable evaluating his or her peers on history-taking skills||4||5||4.47||0.51|
|SE appeared comfortable evaluating his or her peers on communication skills||1||5||4.09||0.95|
|SE appeared comfortable evaluating his or her peers on physical examination skills||3||5||4.21||0.71|
|SE had higher expectations of student performance than FE||1||4||2.26||0.76|
|SE gave harsher feedback than FE would have given||1||3||1.85||0.60|
|SE learned from being an examiner (seeing another student’s approach to a station, reading the checklists, etc.)||3||5||4.26||0.76|
|There was tension between certain student combinations that would be detrimental to the feedback or evaluation process||1||2||1.59||0.50|
|This could occur without faculty members' presence||1||5||3.74||0.98|
|SEs should continue to be used for formative OSCEs||1||5||4.11||1.01|
|SEs should continue to be used for summative OSCEs||1||5||2.85||1.03|
|I felt comfortable evaluating my peers on history-taking skills||3||5||4.52||0.68|
|I felt comfortable evaluating my peers on communication skills||4||5||4.54||0.51|
|I felt comfortable evaluating my peers on physical examination skills||3||5||4.52||0.60|
|The evaluation grid gave enough information to evaluate my peers effectively||2||5||4.19||0.92|
|The training given on how to give feedback was helpful||2||5||3.44||1.09|
|I felt comfortable giving feedback to my peers||2||5||4.48||0.70|
|I would feel comfortable evaluating the students without faculty members' presence||3||5||4.44||0.58|
|I found this to be a useful learning experience||3||5||4.44||0.70|
|There was tension between myself and certain student-examinees that would be detrimental to the feedback or evaluation process||1||4||1.37||0.84|
|A peer evaluation OSCE would be acceptable if it was formative (i.e. for practice only)||2||5||4.52||0.75|
|A peer evaluation OSCE would be acceptable for summative purposes (i.e. scores count toward final grade)||2||5||3.78||1.01|
Student-examiner post-OSCE questionnaire
Table 4 shows that SEs agreed strongly that they were comfortable in examining candidates in history-taking, communication and physical examination skills. They agreed that the evaluation grids provided them with enough information to evaluate the candidates, but were neutral regarding the helpfulness of the training on giving feedback. Despite this, SEs were comfortable in giving feedback to their peers and would be comfortable evaluating students without faculty member presence. Most considered this OSCE to represent a useful learning experience and very few thought that any tension – which would be detrimental to the feedback and evaluation process – existed between themselves and the candidates. The SEs agreed strongly that an SE OSCE would be acceptable as a formative process, but only slightly that it would be so as a summative process.
Candidate pre- and post-OSCE questionnaires
Candidate perceptions regarding the SE evaluation and feedback process were more favourable after the OSCE than before it, as shown in Table 5. All indicators in the candidate questionnaire improved post-OSCE, although none of the differences between pre- and post-OSCE responses reached statistical significance except where indicated. Post-OSCE, candidates agreed strongly that they were comfortable being evaluated by more senior students on history-taking and physical examination skills. Candidates strongly agreed that an SE OSCE would be acceptable as a formative assessment and were neutral regarding its acceptability as a summative assessment. If given the option, 95% of students would prefer to be examined by Year 4 students rather than Year 3 students. After the OSCE, candidates strongly agreed that the feedback they had received had been constructive and that it had been given in an appropriate manner. Candidates agreed that SE OSCEs should continue and disagreed that tension existed between candidates and SEs. When asked if they preferred an FE to an SE, candidates were neutral.
Table 5. Candidates pre- and post-OSCE questionnaires
|I felt comfortable being evaluated by an SE on my history-taking skills||3.94||0.94||4.68||0.59|
|I felt comfortable being evaluated by an SE on my communication skills||4.00||0.96||4.66||0.59|
|I felt comfortable being evaluated by an SE on my physical examination skills||3.95||0.89||4.68||0.56|
|A peer OSCE would be acceptable for formative purposes (i.e. for practice only)||4.42||0.90||4.61||0.91|
|A peer OSCE would be acceptable for summative purposes (i.e. scores count toward final grade)||3.32||1.08||3.98||1.02|
|I received constructive feedback||–||–||4.73||0.51|
|The feedback I received was given to me in an appropriate manner||–||–||4.88||0.33|
|I would prefer to have a faculty examiner instead of a senior medical student||–||–||2.86||1.15|
|There was tension between myself and a certain SE that would be detrimental to the feedback or evaluation process||–||–||1.89||1.24|
|Peer evaluation OSCEs are worthwhile||–||–||4.43||0.66|
Pre-OSCE candidate questionnaire comments
Pre-OSCE, candidate comments regarding their concerns fell under a few main themes. The most important (39 comments) were related to the possibility of the candidate knowing the SE and how this might cause embarrassment or might change their relationship. The next theme concerned whether or not the SE had the knowledge to evaluate adequately (10 comments). The last theme related to whether SEs would mark more severely than FEs (six comments). Pre-OSCE, candidates perceived the benefits as representing evaluation by an empathetic student (25 comments), an opportunity to practise (22 comments) and an opportunity to gain feedback (six comments).
Post-OSCE candidate questionnaire comments
Post-OSCE, 37 candidates commented on whether the experience had been worthwhile. Of these, 33 comments were positive and focused on how good the SEs had been in giving feedback and how the OSCE had represented a great learning opportunity. Three comments specifically mentioned the possibility of conflict if the candidate knew the SE well. The only negative comment gave the opinion that SEs sometimes did not pick up on everything that was done and indicated that the candidate would have preferred to have an FE who had more experience.
This study was designed to compare SE and FE assessment of candidate performances on an OSCE, to assess SE feedback and to seek opinions regarding SE acceptability. This appears to be the first published study of a comprehensive medical student OSCE in which FEs directly observed medical student-examiners.
Student-examiners appeared to be capable of assessing candidate performances, with moderate to high correlations between rater types noted. That said, correlations between examiner types were higher for checklist scores than for the global rating scale. This pattern of findings is similar to that found in other studies. For example, in a study comparing trained assessors with doctor-examiners, Humphrey-Murto et al.2 found a high correlation between pairs of examiners on a checklist, but poor agreement in terms of pass/fail standings that were calculated using a global rating identical to that used in our study. Similarly, Rothman and Cusimano13 found lower agreement between SPs and doctor-examiners when examinee interview skills were scored using a rating scale. One possible explanation for this finding is that when doctor-examiners use the rating scale, they may either give the benefit of doubt to an examinee because they realise what the examinee is trying to do, or they may penalise an examinee because something was not done well. A non-doctor-examiner would not have the medical knowledge to interpret what the examinee is doing and so would not be able to either give credit or penalise the examinee. With a checklist, an examinee either does or does not do something and thus there is less potential for the doctor-examiner to credit an action. It is also possible that global ratings are less objective when communication skills are highlighted. This is certainly one of the conclusions drawn by Rothman and Cusimano.13 In our study, correlations for two of the stations in particular were lower. Both stations were history-taking stations; station 3 involved a history for dysmenorrhoea and station 6 involved a history for back pain. Station 1 was a psychiatry history-taking station, but did not have a lower correlation between raters; thus it is not clear to what degree communication skills modulate the correlation between SEs and FEs on a rating scale.
Feedback provided by SEs was rated highly by FEs. Faculty examiners found the feedback given by the SEs to be balanced, specific, accurate, appropriate, and given in a professional manner. The feedback was rated as being similar to the feedback the FEs themselves would have given. This appears to be the first study to provide faculty members' assessment of SE feedback from direct observation.
This student-examiner OSCE was well received by all participants. Candidates, SEs and FEs thought that a formative OSCE could be administered without faculty members present. In addition, both SEs and FEs agreed that being an examiner is a good learning experience for the student and that more training on how to give feedback would be helpful. Candidates significantly increased their comfort (from agree to strongly agree) in having a student evaluate their skills after the OSCE had occurred. In our study we found that FEs, SEs and candidates found the experience positive and worthwhile. Calhoun et al.14 and Rees et al.15 also reported that students considered evaluation by students to be valuable, informative and stimulated them to learn more and practise their skills. Martin et al.3 and Reiter et al.10 both suggest that an SE OSCE should be a formative event. All three groups in our study agreed that an SE OSCE should be a formative event only.
The limitations of this study include the fact that it was a single-site study and thus its findings may not be generalisable to other locations and assessment methods. All participants were volunteers. From a candidate perspective, this may have generated a relatively highly performing homogeneous group. It is unclear whether feedback would have differed with a more heterogeneous pool of examinees and whether it would have led to a larger proportion of failures. Our SEs were volunteers and thus were a pre-selected, motivated group of students. It is also unclear if the same level of quality feedback and professionalism would exist if faculty staff had not been in the room directly observing the encounter.
In summary, SEs appear to be a viable alternative to FEs in a formative OSCE in terms of their abilities to both complete checklists and provide feedback.
Contributors: All authors made substantial contributions to the conception and design of the study, and the acquisition of data. BP, A-MJP and SH-M designed the objective structured clinical examination by creating the questions and the checklists. TJW and SH-M analysed and interpreted the data. All authors contributed to the drafting and critical revision of this article, and approved the final manuscript for publication.
Conflicts of interest: none.
Ethical approval: this study was approved by the Ottawa Hospital Research Ethics Board, Ottawa, Ontario, Canada.