Objective structured clinical examination (OSCE) is a key part of medical student assessment. Currently, assessment is performed by medical examiners in situ. Our objective was to determine whether assessment by videotaped OSCE is as reliable as live OSCE assessment.
Participants were 95 undergraduate medical students attending their musculoskeletal week at Freeman Hospital, Newcastle (UK). Student performance on OSCE stations for shoulder or knee examinations was assessed by experienced rheumatologists. The stations were also videotaped and scored by a rheumatologist independently. The examinations consisted of a 14-item checklist and a global rating scale (GRS).
Mean values for the shoulder OSCE checklist were 17.9 by live assessment and 17.4 by video (n = 50), and 20.9 and 20.0 for live and video knee assessment, respectively (n = 45). Intraclass correlation coefficients for shoulder and knee checklists were 0.55 and 0.58, respectively, indicating moderate reliability between live and video scores for the OSCE checklists. GRS scores were less reliable than checklist scores. There was 84% agreement in the classification of examination grades between live and video checklist scores for the shoulder and 87% agreement for the knee (κ = 0.43 and 0.51, respectively; P < 0.001).
Video OSCE has the potential to be reliable and offers some advantages over live OSCE including more efficient use of examiners' time, increased fairness, and better monitoring of standards across various schools/sites. However, further work is needed to support our findings and to implement and evaluate the quality assurance issues identified in this work before justifiable recommendations can be made.
Objective structured clinical examinations (OSCEs), first introduced by Harden and Gleeson in 1979 (1), are widely used in the assessment of undergraduate and postgraduate medical students and are regarded as offering better validity than traditional long-case final examinations (2). In a long-case examination, the student sees a patient alone for 30–60 minutes, obtains a history, and performs a physical examination. The student is then questioned about the findings, relevant investigations, and further treatment of the patient. The processes of history taking and examination are not observed. Therefore, communication and examination skills may not be adequately assessed (3). The advantages of OSCEs include greater reliability (they provide a consistent challenge for all candidates assessed ) and greater face and content validity (5) because the process as well as the outcome can be assessed. In addition, OSCEs allow sampling of a greater range of skills than the long-case examination. They have also been shown to correlate better with consultant rating of the candidate (2) than traditional clinical assessments based on long- and short-case examinations. However, OSCEs require a great deal of organization, not the least the coordination of a large number of clinicians to be in the same place at the same time for a single examination. Not only must these clinicians be in one place, they also should have undergone some training in the assessment to maximize reliability. Furthermore, long OSCE assessment sessions can affect the objectivity of the assessors due to fatigue.
Videotapes have been used for a number of years for a variety of purposes within medical education. They are perceived as effective learning resources in the field of communication skills (6, 7) and have been used in the learning of skills for self and tutor assessment (8–11). Lane and Gottlieb (8) found that use of videotaping improved students' interviewing skills and self assessment and had the advantage of identifying students who overrated themselves. Videos have also been used in the evaluation of educational interventions (12). They have been used for evaluating performance and competency (13) as well as rater bias (14). Videos have been used in the assessment of communication skills (15, 16) and in the assessment of general practice trainees' consultation skills in the UK since the 1980s and have been found to be effective, valid, and reliable (17).
Successful implementation of videotaped OSCEs (VOSCEs) would offer considerable potential advantages to faculty, examiners, and candidates. The first advantage is in terms of quality control. Videotaped OSCE stations offer the potential for establishing consensus between examiners for investigating interexaminer variability and even for comparison of standards between medical schools. It is possible to increase the objectivity of assessment by having assessors evaluate examination skills based on an agreed standardized marking criteria. The second advantage is in terms of practicality. Running an OSCE for a group of students is very time consuming and requires expensive clinical expertise and coordination. Videotaping the student performance and marking the performance at a later point means the OSCE can be run with relatively few, if any, clinicians present because stations do not necessarily have to be manned by clinical assessors. Therefore, the examination process may be perceived to be more efficient and reliable: the cost and stress involved in organizing the OSCE might be reduced while improving the consistency and fairness of assessments.
Evidence that VOSCEs are a practical, valid method of implementing OSCEs has not been established in the field of musculoskeletal medicine and is little explored in other fields of physical examination. In this study, we carried out formative OSCE assessments of third-year undergraduate medical students performing shoulder and/or knee examination as part of a larger educational randomized controlled trial (18) and videotaped these OSCEs. We present results of an investigation of the relationship between the live examiner's assessment and that of a video assessor, and we discuss the practicalities of videotaping musculoskeletal examination OSCE stations.
PARTICIPANTS AND METHODS
This study was performed alongside a randomized controlled trial evaluating the educational value of a computer-assisted learning program, Virtual Rheumatology CD (Newfangled Media, Stoke on Kent, UK), in the teaching of musculoskeletal clinical examination skills in undergraduate medical students (18). The study took place at the University of Newcastle upon Tyne, Newcastle, UK.
Participants were a subgroup of subjects who took part in the randomized controlled trial and included third-year undergraduate medical students attending their musculoskeletal week during a 12-week clinical skills module at Freeman Hospital, Newcastle between January 2002 and June 2003. Prior to the start of placement, these students had attended a 1-week clinical skills block, which included teaching of musculoskeletal examination.
The OSCE consisted of a station on knee examination and a station on shoulder examination. Participants in this study were examined on one station only. Each station was 6 minutes long. There was a 14-item checklist for scoring the OSCE. Students did not have access to this checklist prior to the examination. For each item, a score of 0, 1, or 2 was given for “not done,” “done,” and “done well,” respectively. Scores of individual items were summed for each station, resulting in total scores for the OSCE shoulder and knee assessments based on discrete numerical scales ranging from 0 (“not done” recorded on all items) to 28 (“done well” recorded on all items). In addition, we added a global rating scale (GRS) as a supplementary measure (a 10-cm visual analog scale ranging from 0 = poor to 100 = excellent) to the OSCE score sheets. GRS have been shown to be valid measures for the assessment of clinical skills (19).
Video recording of the OSCEs.
Digital video cameras were attached to tripods and placed in the room where the examination took place. Student performance was videotaped with consent. The recording was done on mini digital videotapes and was converted to VHS tapes for ease of scoring.
Local rheumatology specialist registrars (SpR; qualified physicians undertaking specialist postgraduate training in rheumatology, equivalent to residents in the US) volunteered for the OSCE assessment of students. It was not feasible to train the registrars especially for this study but all had prior experience in administering and scoring OSCEs. We had 2 main assessors for the knee station and the shoulder station. However, due to clinical commitments, other SpR had to stand in for our regular assessors. Overall, 4 raters were involved in assessing the knee and/or shoulder stations. A consultant rheumatologist (ABH), who was blind to the live scores of the OSCE assessment, scored the VOSCE for both knee and shoulder stations.
On day 4 of the week's rotation, students were asked to volunteer for a formative OSCE examination. Students were randomly allocated to 1 of 2 OSCE stations and at the end of the assessment were given verbal and written feedback on their performance. To preserve participant confidentiality, each student was assigned an anonymized code, which protected his or her name and identity from the VOSCE assessor. Approval was obtained from respective chairs of the university ethics committee. Written consent was obtained from the students for video recording of the OSCE and for using the data from the OSCE for the research study.
A sample size between 40 and 50 participants was required for each reliability analysis in order to calculate confidence intervals to the precision of ±0.2 on either side of the reliability coefficients.
The association between live and video OSCE checklist and GRS scores was assessed in a number of ways. Mean difference and 95% limits of agreement were calculated. Consistency of scoring between the measures was assessed using Pearson's correlation. Absolute agreement was determined using the intraclass correlation coefficient (ICC) using a 2-way random effects model (ICC2,1). OSCE checklist scores were classified according to the traditional examination grading system of fail (score <14, i.e., <50%), pass (score 14–20, i.e., 50–74%), and honors (score ≥21, i.e., ≥75%), and reliability between live and video grades was evaluated using observed agreements and the chance-corrected weighted kappa statistic (using linear weights). In addition to evaluating total scores, we also looked at the reliability of each of the individual items of the 2 OSCE stations using observed agreements and weighted kappa.
Fleiss demonstrated that the ICC was closely related to the weighted kappa (20), and recommended that an ICC value <0.4 was poor, between 0.4 and 0.75 was fair to good, and >0.75 was excellent (21). We adopted the similar and widely accepted classification according to Landis and Koch (22) to provide adjectives to describe the reliability values for the ICC and kappa calculated in this study: 0.01–0.20 indicated slight, 0.21–0.40 indicated fair, 0.41–0.60 indicated moderate, 0.61–0.80 indicated substantial, and 0.81–1.00 indicated almost perfect.
A random subsample of participant videos were rescored after 3 months by the same consultant (ABH). The intrarater agreement of the OSCE checklist scores was evaluated by ICC and by kappa (after classifying the scores into grades [fail, pass, honors] as described above).
The results are based on 50 matched pairs of observations for the shoulder OSCE and 45 for the knee. Of the 4 assessors, one (CM) scored 39 (78%) participants, one (DC) scored 7 (14%), and another (MF) scored 4 (8%) at the shoulder station; one assessor (DC) scored 26 (58%) participants, one (MB) scored 15 (33%), and one (MF) scored 4 (9%) at the knee station. The subgroup of individuals who were assessed by VOSCE in addition to the OSCE for this study had similar baseline characteristics to individuals who were assessed by OSCE but not VOSCE, e.g., 66% and 69% were women, respectively; mean OSCE shoulder scores were 19.0 and 18.5, respectively; and mean OSCE knee scores were 21.1 and 21.0, respectively.
Paired data for the live and video scores are illustrated in Figure 1. Live and video summary scores on the OSCE checklist were very similar (Table 1). Mean values for the OSCE checklist shoulder scores were 17.9 by live assessment and 17.4 by video assessment, and for the knee scores were 20.9 and 20.0, respectively. By contrast, GRS scores were lower for the video assessment than the live assessment. Pearson's correlation coefficients between OSCE and GRS scores for the live and videotaped assessments ranged from 0.46 to 0.66 (Table 2). The ICC coefficients indicated moderate reliability between video and live scores, with values of 0.55 and 0.58 for the OSCE checklist of the shoulder and knee, respectively. The reliability was only fair between scores for the global ratings. The video examiner consistently scored candidates lower than did the live examiner on the GRS score, but not on the checklist score (Figure 1).
Table 1. Mean ± SD scores for live and video objective structured clinical examination assessments
Composite scale (n = 50)
Global scale (n = 38)
Composite scale (n = 45)
Global scale (n = 31)
17.9 ± 3.4
76.0 ± 11.3
20.9 ± 2.5
73.0 ± 11.5
17.4 ± 3.4
59.0 ± 16.1
20.0 ± 2.8
60.5 ± 15.8
Table 2. Reliability of scoring of the objective structured clinical examination for live versus video assessments*
Data comparing the live and video ratings of the individual items of the shoulder assessment and the knee assessment are presented in Tables 3 and 4, respectively. Large variations in reliability were seen across the items in both shoulder and knee OSCE stations. Substantial reliability (κ > 0.6) for shoulder OSCE was seen for the items “performs resisted movement,” “inspects active neck movements,” “assesses the acromioclavicular joint,” “asked patient to put hands behind head and hands behind back,” and “external rotation of the shoulder with the elbows tucked in.” Similarly, substantial reliability (κ > 0.6) for knee OSCE was observed for “assesses temperature,” and moderate reliability (κ = 0.41–0.60) was observed for “assesses full extension,” “assesses full flexion,” “palpates joint line,” “patella tap,” and “undertakes active and passive movements.”
Table 3. Agreement between live and video ratings for individual items of the objective structured clinical examination (OSCE) shoulder assessment*
Based on linear weights. Agreement expected by chance alone; the kappa coefficient measures the chance-corrected agreement ([observed agreement − expected agreement]/[1 − expected agreement]).
Agreement expected by chance alone; the kappa coefficient measures the chance-corrected agreement ([observed agreement − expected agreement]/[1 − expected agreement]).
Approach to the patient (including asking about knee pain)
−0.06 (−0.32, 0.21)
Inspection (including from the end of the bed)
0.18 (−0.10, 0.46)
Assessment of temperature
0.68 (0.43, 0.94)
Assessment of muscle bulk
0.18 (0.00, 0.37)
Palpation of patella
0.27 (0.07, 0.47)
Palpate joint line (including the back of the knee)
0.50 (0.26, 0.74)
Patella tap ± cross fluctuation
0.44 (0.21, 0.66)
Assess full extension
0.52 (0.32, 0.73)
Assess full flexion
0.45 (0.24, 0.66)
Collateral ligament assessment at 15 degrees
0.19 (−0.07, 0.45)
Undertakes active and passive movements
0.44 (0.22, 0.66)
Anterior draw test
−0.03 (−0.31, 0.25)
Gets patient to walk
0.05 (−0.06, 0.16)
Identifies normality/abnormalities correctly
−0.12 (−0.32, 0.07)
Reliability was moderate (κ = 0.41–0.60) for the overall grades of the shoulder and knee OSCE assessments (Table 5). This could be further improved by considering the omission or modification of poorer agreement items (see Tables 3 and 4) within the OSCE checklists.
Table 5. Agreement between live and video ratings for graded classification of the objective structured clinical examination (OSCE) checklist*
We rescored 22 video OSCEs to evaluate the intrarater agreement of the VOSCE. The test–retest included 11 shoulder checklists and 11 knee checklists. The scores were pooled so that the reliability analysis was based on 22 pairs of scores. The ICC was 0.98 (95% confidence interval 0.96, 0.99) and the kappa value based on graded classification of scores was 1.00 (100% agreement).
Our goal was to investigate the relationship between the assessments of live and videotaped OSCE stations. Our results demonstrated moderate interrater reliability between the live scorer(s) and the video scorer for both the knee station and the shoulder station using a checklist scoring approach. The interrater reliability using a GRS was lower: the live (SpR) scorer consistently scored the students higher than did the (consultant) video scorer, indicating examiner bias. Poor interrater reliability for the GRS may reflect different expectations on the part of a strict consultant compared with the lenient SpR. Additionally, it is possible that the live examiner forms more of a relationship with the candidate and therefore tends to give them higher scores.
Reliability between live and video assessments ranged from moderate to almost perfect for 16 of 28 of the individual items of the OSCE checklist. We were not able to distinguish which of the 2 methods of assessment, live or video, was most accurate because there was no gold standard to compare against. The rationale behind the poorer agreement of the remaining 12 items may be viewed from clinical and statistical perspectives. The shoulder items “identifies bony landmarks” and “assess for painful arc” had poor reliability. Identification of bony landmarks can be particularly difficult to score on a video if the focus is not close enough and/or the students do not name the landmarks or explain what they are doing. The video assessor scored students as having done well only if students gave a verbal description of what they were palpating during the procedure. This highlights one of the key areas where differences may occur. In the event of any uncertainty regarding any aspect of the student clinical examination, live assessors can seek further clarification from the students. This is not possible via prospective scoring by video, and underlines a limitation of the video method of assessment. Similarly, the items “inspection” and “assessing muscle bulk” from the knee station are also difficult to score via video unless students describe what they are doing. There was only slight reliability for “collateral ligament assessment at 15 degrees” from the knee checklist, which may be explained in part by the fact that this examination requires complex movements and handling skills. Hoving et al (23) found that movements that were complex and required handling skills had poorer interrater reliability than movements that were simple. The item “approach to the patient” scored very poorly at both knee and shoulder stations in terms of agreement. The video assessor gave a score of 2 only if the student both introduced themselves and specifically asked the patient about pain prior to examining the patient, whereas the live assessors did not appear to have the same criteria for scoring this question. This finding raises another generic point concerning checklist marking: to maximize reliability, the checklist must make explicit how marks are awarded. The item “gets patient to walk” was also scored very differently between the live and video examiners. Scores by the video examiner were most frequently recorded as 0 (“not done”) whereas the live examiners most frequently scored this item as “done well,” suggesting that there were quite different scoring criteria adopted for the 2 approaches, the criteria for the video scoring being more strict. If a student was instructed or prompted by the video examiner, a full mark was not awarded. The item “identifies normality/abnormalities correctly” from the knee station had only slight reliability, although there was better agreement for this item in relation to the shoulder station. Because this item draws from the other checklist items, the poor agreement for this item can be due to poor concordance within other items.
It should be noted that poor reliability may also be deducted from results based on inadequate statistical measurement. In the context of this study, less than moderate reliability was concluded for some items (specifically “approach to the patient,” “inspection,” “anterior draw test”) when the expected (or chance) agreements of the items were high. As the expected agreement increases, the kappa becomes increasingly limited in its capacity to yield meaningful reliability values (24, 25). If for a specified item a certain category has a high likelihood of being scored by all raters, then the expected interrater agreement for that item will be high. For example, both the live rater and the video rater most frequently scored the items “inspection” and “anterior draw test” as having been done well because both items were relatively easy examinations for the students to perform. As a result, the expected agreements of the 2 items were 84% and 97%, respectively (i.e., close to 100%), leaving little room for measuring agreement above that expected by chance alone.
No gold standard exists to establish the content validity of a musculoskeletal examination OSCE station. However, Coady et al (26) have derived a core set of clinical skills relevant to musculoskeletal examination skills in students. Of the 22 core skills relevant to the examination of the shoulder and knee joint from the Regional Examination of Musculoskeletal System (REMS) for undergraduate medical students (26), our OSCEs included 19 skills. The skills not addressed by our tool include assessing leg length when leg length discrepancy is suspected and when appropriate, assessing neurologic and vascular systems during the assessment of a problematic joint, and making a qualitative assessment of movement.
There is published evidence that examiners' clinical experience has an impact on interexaminer agreement on the palpatory diagnosis in osteopathy (27). In this study, the level of agreement between live and video examiners might have been stronger had their level of clinical experience been closer. Unlike the study by Branch and Lipsky (28), which measured the impact of an educational intervention on retention, confidence, and ability of musculoskeletal examination skills of medical students, ours is an exploratory study. There are aspects of this study that can be addressed with improvement. The key area is the lack of face-to-face preassessment discussion between all the assessors on how to score each of the items. This was not possible for several pragmatic reasons. The video assessor was geographically too far away from the live assessors. Owing to the busy schedule of the clinical placement, the formative OSCE assessments were offered as an optional addition during the lunch hour and the volunteer assessors had little time to prediscuss scoring criteria for assessment.
OSCE assessment via video is a very attractive proposition in the current climate of increasing pressure for clinicians to take on the role of teachers and assessors. It may also provide a higher level of consistency between institutions and paves the way for better quality assurance issues such as anonymized marking to increase fairness, ability of all students to go through the stations in the same order, and ability of the facility to monitor standards in assessment across various hospital sites as well as across schools. Moreover, interrater reliability of live scorers has been shown to vary from 0.25 at some stations to 0.77 at others (29), providing evidence that the consistency between live assessors is not much different from the reliability between live and video assessments in this study. To improve the reliability of video or live assessment, it is important to improve the process of assessment (for example, by standardizing methods of evaluation, scoring, and administration). Our test–retest results, albeit based on a subsample of our original study population, suggest that reproducibility of video scoring is likely to be almost perfect, which further implies that the overall reliability of video scoring by different observers and its reliability against live scoring would probably be increased by standardizing the methods of evaluation and trying to establish scoring consensuses between different assessors.
There are a number of other key pragmatic issues, which need to be taken into account when designing an OSCE station that is to be videotaped. It is important that the necessary equipment and expertise are available so that good quality recordings can be obtained. We discarded one videotaped examination as not scorable due to poor positioning of the video camera and therefore poor recording. In this study, we used only 1 video camera to assess the student examining the patient's joint. An alternative method would be to use 2 cameras simultaneously, where one camera could record the student examining and the other could focus on the joint being inspected. The latter may give better visual information to the video assessor and may improve the reliability of assessment of items that involve visual inspection. However, this would have to be weighed against the probable increase in duration of assessment by video. Although studies in other specialties (largely communication skills) demonstrate that the use of videotaping can be a valuable learning experience for students to improve their skills, not all students may be comfortable with being recorded. We did not explore students' views in this study. It is also known that student performance may be influenced by the videotaping process (30). Offering a certain period of adaptation time before the formal assessment phase begins so that the students have a chance to familiarize themselves with the environment may minimize this effect. In contrast, it could be argued that students may express different levels of anxiety about performing clinical examinations in front of a live examiner. It also remains to be seen if VOSCEs are suitable for other specialties in medicine.
Further work is needed to establish the potential for the VOSCE in the assessment of clinical examination skills. The reliability of video scoring after standardized scoring methods have been put in place should be established; our work on intraobserver variability suggests reliabilities will be considerably enhanced. There is room for further investigation of how procedures including the set up process and quality of equipment can improve the integrity of scoring videotaped OSCE assessments. We have not yet addressed the views of examiners and students regarding videotaped assessment. Finally, there is considerable opportunity for investigating whether VOSCE assessments are valid across different clinical specialties.
In conclusion, VOSCEs have the potential of improving quality assurance and saving resources. In practice they need to be conducted with care, taking into account practical issues of camera and patient placement as well as the principles of effective assessment, with good examiner training to ensure consistency of scoring. Finally, this study highlights the potential of VOSCE stations in examiner training. We cannot conclude that videotaped scoring is better than live scoring of OSCE assessments, but our findings do suggest that VOSCE may be an efficient and reliable alternative to traditional live scoring.
Dr. Vivekananda-Schmidt had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study design. Vivekananda-Schmidt, Lewis, Coady, Hassell.
Acquisition of data. Vivekananda-Schmidt, Coady, Morley, Kay, Walker.
Analysis and interpretation of data. Vivekananda-Schmidt, Lewis, Hassell.