Dr. Kissin receives royalties from Gulfcoast Ultrasound for the wrist ultrasound video.
Special Theme Article: Clinical Imaging and the Rheumatic Diseases
Musculoskeletal Ultrasound Objective Structured Clinical Examination: An Assessment of the Test†
Version of Record online: 24 DEC 2013
Copyright © 2014 by the American College of Rheumatology
Arthritis Care & Research
Volume 66, Issue 1, pages 2–6, January 2014
How to Cite
Kissin, E. Y., Grayson, P. C., Cannella, A. C., DeMarco, P. J., Evangelisto, A., Goyal, J., al Haj, R., Higgs, J., Malone, D. G., Nishio, M. J., Tabechian, D. and Kaeley, G. S. (2014), Musculoskeletal Ultrasound Objective Structured Clinical Examination: An Assessment of the Test. Arthritis Care Res, 66: 2–6. doi: 10.1002/acr.22105
The view(s) expressed herein are those of the author(s) and do not reflect the official policy or position of Brooke Army Medical Center, the U.S. Army Medical Department, the U.S. Army Office of the Surgeon General, the Department of the Army, the U.S. Air Force, the Department of Defense, or the U.S. Government.
- Issue online: 24 DEC 2013
- Version of Record online: 24 DEC 2013
- Accepted manuscript online: 7 AUG 2013 01:24PM EST
- Manuscript Accepted: 31 JUL 2013
- Manuscript Received: 13 MAR 2013
- Clinician Scholar Educator Award from the Rheumatology Research Foundation
To determine the reliability and validity of an objective structured clinical examination (OSCE) for musculoskeletal ultrasound (MSUS).
A 9-station OSCE was administered to 35 rheumatology fellows trained in MSUS and to 3 expert faculty (controls). Participants were unaware of joint health (5 diseased/4 healthy). Faculty assessors (n = 9) graded image quality with predefined checklists and a 0–5 global rating, blinded to who performed the study. Interrater reliability, correlation between a written multiple choice question examination (MCQ) and OSCE performance, and comparison of fellow OSCE results with those of the faculty were measured to determine OSCE reliability, concurrent validity, and construct validity.
Assessors' interrater reliability was good (intraclass correlation coefficient [ICC] 0.7). Score reliability was good in the normal wrist and ankle stations (ICC 0.7) and moderate in the abnormal wrist and ankle stations (ICC 0.4). MCQ grades significantly correlated with OSCE grades (r = 0.52, P < 0.01). The fellows in the bottom quartile of the MCQ scored 3.07 on the OSCE, significantly worse than the top quartile fellows (3.32) and the faculty (3.29; P < 0.01). Scores also significantly discriminated bottom quartile fellows from faculty in the normal wrist and ankle stations (3.38 versus 3.78; P < 0.01), but not in the abnormal stations (3.37 versus 3.49; P = 0.08).
MSUS OSCE is a reliable and valid method for evaluation of MSUS skill. Normal joint assessment stations are more reliable than abnormal joint assessment stations and better discriminate poorly performing fellows from faculty. Therefore, MSUS OSCE with normal joints can be used for the assessment of MSUS skill competency.
Utilization of musculoskeletal ultrasound (MSUS) has expanded greatly since its first use in 1958 (). In addition to radiology, many specialties now employ MSUS for point-of-care imaging, including rheumatology, physiatry, podiatry, emergency medicine, general internal medicine, and family practice, and this has led to a 316% increase in MSUS volume from 2000 to 2009 (). The proliferation of MSUS has elicited questions about the qualifications of physicians performing MSUS examination. In response to these questions, certification examinations have been created by the American Registry for Diagnostic Medical Sonography and by the American College of Rheumatology. The purpose of certification should be to insure a minimal level of competence, to stimulate professional growth, and to protect the public by encouraging quality care ().
An examination of MSUS competence must be able to evaluate 2 components: skill in US image acquisition and knowledge required for US image interpretation. Although knowledge of anatomy and pathology as well as the ability to interpret US images can be tested by a multiple choice question examination (MCQ), a practical examination of scanning ability is the most direct method of assessing the skill of obtaining US images. Unfortunately, practical examination is time consuming and expensive, and the reliability and validity of this approach in MSUS has not been established, resulting in debate about whether an objective structured clinical examination (OSCE) should be part of MSUS competency testing.
Our research group has developed a training program for rheumatology fellows that includes online educational resources, remote online image review by rheumatology faculty with expertise in MSUS, and an educational workshop that includes 21 hours of didactic lectures and hands-on scanning of patients and cadaveric joints. Over the course of 8 months, fellows are encouraged to submit 50 comprehensive US studies for faculty review and feedback. Upon completion, fellows travel to a final examination, including an MCQ and an OSCE ().
The purpose of this study was to determine the reliability and validity of an OSCE for MSUS. This was done by assessing interrater reliability for practical examination grading, concurrent validity by comparing OSCE performance with performance on a written examination in MSUS, and construct validity by comparing trainee/fellow OSCE scores with faculty OSCE scores.
Box 1. Significance & Innovations
- This is the first study to assess the reliability and validity of an objective structured clinical examination (OSCE) assessment for musculoskeletal ultrasound.
- This study showed higher reliability and discriminant validity for OSCE stations using normal joints compared to diseased joints.
- This study showed that remote, blinded assessment of OSCE performance is reliable and valid, potentially decreasing the costs associated with organizing an OSCE examination for students of musculoskeletal ultrasound.
MATERIALS AND METHODS
Setting and participants
Thirty-five rheumatology fellows who participated in an 8-month training program in MSUS underwent an examination consisting of a 76-question MCQ and a 9-station OSCE. The MCQ was developed by the program faculty, many of whom are experienced test question writers. All faculty members submitted questions based on prespecified examination content areas. As a group, each question was reviewed and either retained or eliminated based on relevance, clarity, and difficulty. Nine rheumatology faculty members with expertise in MSUS (mean of 6 years of experience) served as proctors for the OSCE, as assessors/graders of the practical examination, and as “gold standard” control participants in the OSCE. Faculty were trained in standardized practical proctoring and grading during an hour-long seminar a few weeks prior to the OSCE, and the testing procedures were reviewed during a 30-minute meeting immediately before the OSCE. Each of 4 healthy volunteers and 5 volunteers with rheumatic joint disease, recruited from a rheumatology outpatient clinic, had 1 joint examined at an OSCE station. The participants were unaware of whether the joint to be examined was abnormal (one of each of the wrist, ankle, elbow, finger, and toe) or normal (one of each of the wrist, ankle, knee, and shoulder). In the abnormal stations, the following pathology was represented: gouty arthritis, synovitis, erosive arthritis (n = 2), and enthesitis. The protocol for this study was exempted by the Institutional Review Board at Boston University School of Medicine.
The participants were expected to perform standardized, comprehensive scans for each joint area tested. The required images, including specific anatomic structures, were the same as those required in the preceding curriculum. At each station, a faculty proctor witnessed and graded the studies being performed using a predefined checklist. Participants were aware of the items on the checklist and the rating system. Each predefined checklist element on the scoring sheet was graded on a 5-point rating scale, where 1 = failing, 2 = borderline pass, 3 = average, 4 = above average, and 5 = “publication quality.” The checklist items included proper adjustment of machine settings, transducer orientation and alignment, as well as artifact-free visualization of the tendons and bone surfaces in each of the required views. Additionally, the graders were asked to provide a global rating score for each participant as an overall assessment of performance at each station on the same 5-point scale. Separately, each resulting US image was graded by 2 faculty assessors blinded to examinee identification using the same scoring sheet. It is important to note that the image grading was based strictly on image characteristics, not on whether the participant identified pathology when this was present. Each volunteer was also scanned by 3 faculty members, and the resulting images were graded along with the trainee images by the blinded faculty assessors.
The borderline group method was used to set the overall OSCE passing score. This method consisted of identifying participants who scored a 2 (borderline pass) on the global rating scale for a station, calculating a mean composite score from the predefined checklist elements for that station for each participant who scored a 2 globally, and averaging the composite scores for all such participants. This composite mean score then served as a passing score for the station, and a mean of the passing scores for the 9 individual stations served as a passing score for the practical examination as a whole ([5, 6]). The Angoff method was used to determine the MCQ pass score ().
Participant scores from the individual checklist elements were averaged for each station. A composite OSCE score was derived for each participant by averaging the participant's mean scores across each station. Interrater reliability for assessors and proctors was estimated using the intraclass correlation coefficient (ICC). The mean of the composite OSCE scores was compared between proctors and assessors using a paired t-test. To assess for potential differences in how normal versus abnormal joint stations were graded, the ICC was used to estimate the reliability of assessor scores between the normal and abnormal wrist and ankle stations, respectively. To assess for potential redundancy of multiple stations, interstation correlation was calculated from the participant's mean scores across each station using Pearson's correlation coefficient.
Concurrent validity was established by correlating MCQ scores and composite OSCE scores (proctor scores; averaged assessor scores) using Pearson's correlation coefficient. The distribution of MCQ scores was compared between the participants who failed and passed the OSCE, as determined by the borderline group method, using Wilcoxon's rank sum test.
Construct validity was established by dividing MCQ scores into quartiles and comparing the trainee/fellow OSCE composite scores in the lowest MCQ quartile to trainee/fellow OSCE scores in the highest quartile and to faculty OSCE scores (gold standard) using Wilcoxon's rank sum test. All calculations were done using SAS, version 9.3. A P value less than 0.05 was used to define statistical significance.
Borderline methodology resulted in an OSCE pass score of 3.0, while the Angoff method resulted in an MCQ pass score of 62. Five fellows received a failing score on the OSCE from both examination assessors. Nine fellows received a failing score on the MCQ portion of the examination.
Interrater reliability for OSCE grading was good (ICC 0.7) between the assessors, but was poor (ICC 0.3) between the assessors and the proctors. The proctors consistently gave a higher OSCE station score than the average of the 2 blinded assessors (3.6 versus 3.2; P < 0.0001).
Reliability of the assessor scores was good in the normal/healthy wrist and ankle stations (ICC 0.7) and moderate in the abnormal wrist and ankle stations (ICC 0.4). The mean interstation correlation (comparison of station scores within each participant) was low (r = 0.16, range −0.15 to 0.57).
MCQ scores significantly correlated with composite OSCE scores from both of the assessors (r = 0.52, P < 0.01) (Figure 1) and from the proctors (r = 0.58, P < 0.01). The mean MCQ score for the 5 fellows who failed the OSCE was less than that for the 30 who passed (60% versus 71%; P = 0.04).
The fellows in the bottom quartile of the MCQ scored 3.07 on the OSCE, which is significantly lower than the top quartile fellows (3.32) and the faculty (3.29; P < 0.01 for both groups). Composite OSCE scores also significantly discriminated bottom quartile fellows from faculty in the normal wrist and ankle stations (3.38 versus 3.78; P < 0.01), but not in the abnormal stations (3.37 versus 3.49; P = 0.08) (Figure 2). Top MCQ quartile fellows outperformed the bottom quartile MCQ fellows in the abnormal stations (3.78 versus 3.37; P = 0.01).
No fellows received an OSCE failing score from the proctors. Five fellows (14%) received an OSCE failing score from both blinded assessors. Thirteen fellows (37%) received a failing score from only 1 blinded assessor. All 5 fellows who failed the OSCE by grades from both blinded assessors also scored in the bottom 2 quartiles on the MCQ. Conversely, 13 of the fellows who were in the bottom 2 quartiles on the MCQ passed the OSCE. Three fellows failed both the OSCE and MCQ portions of the examination, whereas 8 fellows failed 1 of the 2 sections (Table 1).
|MCQ quartile 1 (49–62%)||MCQ quartile 2 (63–70%)||MCQ quartile 3 (71–75%)||MCQ quartile 4 (76–89%)|
|OSCE fail (<3.0)||3||2||0||0|
|OSCE pass (≥3.0)||6||7||9||8|
We found the 9-station MSUS OSCE to be a reliable and valid method for evaluation of MSUS skill. Both concurrent and construct validity were suggested by blinded OSCE image assessment successfully discriminating poorly performing fellows on the MCQ from well-performing fellows and faculty members. Given the high-stakes nature of a certification examination and the disagreement we found between assessors in giving a failing practical examination grade (36% of fellows failed by one assessor versus 14% failed by both assessors), it seems reasonable to require that at least 2 assessors grade the examination images (). Remote, blinded grading of the OSCE images, as done in this study, increases result reliability by allowing more than one assessor to grade each station, while increasing feasibility by limiting cost. Furthermore, our finding that 11 fellows failed either the practical or the written examination while 3 fellows failed both examinations implies that a certification examination that does not include both components may lead to certification of some practitioners who can identify pathology on a test question image, but do not have sufficient skill to obtain the necessary images themselves. Other investigators have found similar discrepancies between pass rates of OSCE and written examinations ().
Low interstation correlation suggests that each station evaluated relatively independent skills, and that decreasing the number of stations may substantially impact the reliability of the OSCE (). Similar findings regarding the minimal number of OSCE stations needed for a high-stakes examination have been reported previously ().
Although proctors were able to see the “live” performance of the examination as well as the resulting images, this did not increase the validity of their grades in comparison to images graded “blindly.” In addition, witnessing the examination in progress led to grade “inflation.” The proctors saw the actual US performed, and this alone may have led to grade inflation because they witnessed that the target anatomic structures were imaged, but may not have been captured (“frozen”) optimally. Clerkship directors in internal medicine have reported delivering negative feedback as the top explanation for grade inflation (). Therefore, giving a poor grade to a trainee in the same room may be more difficult than doing so to an anonymous trainee one is not facing. In addition to assigning a grade, proctors were tasked with labeling and storing images, making machine adjustments as requested by the examinees, and ensuring all paperwork was in order during the test. This multitasking and fatiguing environment under the time constraints of a “live” examination is much different from the quiet, self-paced study of images by the blinded assessors.
The potential of replacing proctor grading with image grading remotely could also decrease the cost of a practical MSUS examination and increase the availability of qualified faculty for practical grading, thus making a practical examination more reliable () and more feasible. These factors should be balanced against the additional information available to a proctor but not to a blinded assessor, such as correct patient positioning, attention to patient comfort, and thoroughness of the scan (appropriately scanning all the way through a structure versus only scanning to make one perfect image).
The findings that normal joint assessment stations were more reliable than abnormal joint assessment stations and better discriminated poorly performing fellows from faculty also impact feasibility, since recruiting patients with rheumatic disease for a day-long practical examination is substantially more challenging than recruiting healthy volunteers. The use of patients with rheumatic diseases would also increase yearly test variability.
The reasons for the decreased grading reliability and decreased discriminant utility of OSCE stations with diseased joints are not clear. Efforts to limit variability in OSCE grading included the use of predefined checklists and training the faculty in standardized grading. Anatomic structures in a normal joint are straightforward and easier to quantitatively assess using predefined checklists. However, this distinction is less clear in a diseased joint where anatomic structures of interest may not be as easily demonstrated in a single image. Assessors may not be certain about the optimal appearance of abnormal joints for grading purposes, and therefore may have more difficulty scoring the resulting images in concordance. In addition, more competent sonographers may find and record the best representation of pathology in a single image as opposed to the target structures on the checklist, which may result in lower scores based on the grading paradigm. This variability could have been minimized by using consensus between assessors when scoring discrepancies were apparent.
Our comparison of normal and abnormal stations is also limited by the number of stations overall, and the results may be impacted by the specific joints and pathology evaluated during the examination. While we tried to present pathology typical for a rheumatology practice, representing a complete array of rheumatologic pathology on a practical examination is not feasible. Therefore, it is more practical to test knowledge of pathology on an MCQ style of examination, and it is even more important to include a broad array of pathology on the MCQ if stations with pathology are not part of the practical examination. In any case, it is possible that our result might have been changed by using patient volunteers with different pathology.
MSUS competency assessment is an important aspect of training, since the achievement of higher levels of competence can result in better-quality care and lower cost utilization ([12-15]). Since competence in MSUS depends not only on image interpretation but also on image acquisition, practical examination through OSCE could be used to supplement the MCQ to ensure that trainees who pass the examination have not only the knowledge necessary to interpret US images, but also the skill to obtain adequate images for interpretation. Finally, learners are motivated to acquire knowledge and skills to meet the challenge of whatever testing format is anticipated. Our trainees were likely motivated to improve their image acquisition skills in preparation for the OSCE. OSCE testing of a separate group who prepared only for an MCQ format would be required to test this hypothesis.
This study examined the critical factors of reliability and validity required for the use of an OSCE as part of MSUS certification. Our findings suggest that a valid MSUS OSCE can be performed utilizing normal anatomy with remote grading by blinded experts. Additional studies are necessary to confirm our findings of these key test characteristics. In addition, further research on the cost of OSCE implementation in a certification examination is needed.
All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be published. Dr. Kissin had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study conception and design. Kissin, DeMarco, Higgs, Nishio, Kaeley.
Acquisition of data. Kissin, Cannella, DeMarco, Evangelisto, Goyal, al Haj, Higgs, Malone, Nishio, Tabechian, Kaeley.
Analysis and interpretation of data. Kissin, Grayson, al Haj, Kaeley.
- 12Impact of certified CME in atrial fibrillation on administrative claims.Am J Manag Care2012;18:253–60., , , , , , et al.