Intermethod Reliability of Real-time Versus Delayed Videotaped Evaluation of a High-fidelity Medical Simulation Septic Shock Scenario

Authors

  • Justin B. Williams MD,

    1. From the Department of Emergency Medicine, San Antonio Military Medical Center–North (JBW, MWH, PCC), San Antonio Uniformed Services Health Education Consortium Emergency Medicine Residency (MAM); and the Department of Emergency Medicine, San Antonio Military Medical Center–South (ALW, MGG), San Antonio Uniformed Services Health Education Consortium Emergency Medicine Residency, Fort Sam Houston, TX.
    Search for more papers by this author
  • Marie A. McDonough MD,

    1. From the Department of Emergency Medicine, San Antonio Military Medical Center–North (JBW, MWH, PCC), San Antonio Uniformed Services Health Education Consortium Emergency Medicine Residency (MAM); and the Department of Emergency Medicine, San Antonio Military Medical Center–South (ALW, MGG), San Antonio Uniformed Services Health Education Consortium Emergency Medicine Residency, Fort Sam Houston, TX.
    Search for more papers by this author
  • Michael W. Hilliard MD,

    1. From the Department of Emergency Medicine, San Antonio Military Medical Center–North (JBW, MWH, PCC), San Antonio Uniformed Services Health Education Consortium Emergency Medicine Residency (MAM); and the Department of Emergency Medicine, San Antonio Military Medical Center–South (ALW, MGG), San Antonio Uniformed Services Health Education Consortium Emergency Medicine Residency, Fort Sam Houston, TX.
    Search for more papers by this author
  • Annette L. Williams MD,

    1. From the Department of Emergency Medicine, San Antonio Military Medical Center–North (JBW, MWH, PCC), San Antonio Uniformed Services Health Education Consortium Emergency Medicine Residency (MAM); and the Department of Emergency Medicine, San Antonio Military Medical Center–South (ALW, MGG), San Antonio Uniformed Services Health Education Consortium Emergency Medicine Residency, Fort Sam Houston, TX.
    Search for more papers by this author
  • Peter C. Cuniowski MD,

    1. From the Department of Emergency Medicine, San Antonio Military Medical Center–North (JBW, MWH, PCC), San Antonio Uniformed Services Health Education Consortium Emergency Medicine Residency (MAM); and the Department of Emergency Medicine, San Antonio Military Medical Center–South (ALW, MGG), San Antonio Uniformed Services Health Education Consortium Emergency Medicine Residency, Fort Sam Houston, TX.
    Search for more papers by this author
  • Michael G. Gonzalez MD

    1. From the Department of Emergency Medicine, San Antonio Military Medical Center–North (JBW, MWH, PCC), San Antonio Uniformed Services Health Education Consortium Emergency Medicine Residency (MAM); and the Department of Emergency Medicine, San Antonio Military Medical Center–South (ALW, MGG), San Antonio Uniformed Services Health Education Consortium Emergency Medicine Residency, Fort Sam Houston, TX.
    Search for more papers by this author

  • Presented at the Society for Academic Emergency Medicine annual meeting, Washington, DC, May 2008.

  • The conclusions and opinions reported by the authors do not necessarily reflect the official position of the U.S. Department of Defense or the United States Army.

Address for correspondence and reprints: Justin B. Williams, MD; e-mail: justin.williams2@amedd.army.mil.

Abstract

Objectives:  High-fidelity medical simulation (HFMS) is increasingly utilized in resident education and evaluation. No criterion standard of assessing performance currently exists. This study compared the intermethod reliability of real-time versus videotaped evaluation of HFMS participant performance.

Methods:  Twenty-five emergency medicine residents and one transitional resident participated in a septic shock HFMS scenario. Four evaluators assessed the performance of participants on technical (26-item yes/no completion) and nontechnical (seven item, five-point Likert scale assessment) scorecards. Two evaluators provided assessment in real time, and two provided delayed videotape review. After 13 scenarios, evaluators crossed over and completed the scenarios in the opposite method. Real-time evaluations were completed immediately at the end of the simulation; videotape reviewers were allowed to review the scenarios with no time limit. Agreement between raters was tested using the intraclass correlation coefficient (ICC), with Cronbach’s alpha used to measure consistency among items on the scores on the checklists.

Results:  Bland-Altman plot analysis of both conditions revealed substantial agreement between the real-time and videotaped review scores by reviewers. The mean difference between the reviewers was 0.0 (95% confidence interval [CI] = –3.7 to 3.6) on the technical evaluation and –1.6 (95% CI = –11.4 to 8.2) on the nontechnical scorecard assessment. Comparison of evaluations for the videotape technical scorecard demonstrated a Cronbach’s alpha of 0.914, with an ICC of 0.842 (95% CI = 0.679 to 0.926), and the real-time technical scorecard demonstrated a Cronbach’s alpha of 0.899, with an ICC of 0.817 (95% CI = 0.633 to 0.914), demonstrating excellent intermethod reliability. Comparison of evaluations for the videotape nontechnical scorecard demonstrated a Cronbach’s alpha of 0.888, with an ICC of 0.798 (95% CI = 0.600 to 0.904), and the real-time nontechnical scorecard demonstrated a Cronbach’s alpha of 0.833, with an ICC of 0.714 (95% CI = 0.457 to 0.861), demonstrating substantial interrater reliability. The raters were consistent in agreement on performance within each level of training, as the analysis of variance demonstrated no significant differences between the technical scorecard (p = 0.176) and nontechnical scorecard (p = 0.367).

Conclusions:  Real-time and videotaped-based evaluations of resident performance of both technical and nontechnical skills during an HFMS septic shock scenario provided equally reliable methods of assessment.

Society is demanding improved patient safety and improved clinical skills from junior physicians.1 Currently, most educational interventions in emergency medicine (EM) resident training are in a lecture-based format, but high-fidelity medical simulation (HFMS) is now being heavily integrated into graduate medical education and is becoming integral to the educational process.

Standardized patient examinations are a common method of evaluation in medical school training. The opportunity for assessment of interpersonal skills, clinician–patient interaction, and team building is present, but these exchanges generally involve low risk, common patient interactions.2 HFMS-based evaluation is often specialty specific, procedure specific or is focused on uncommon or high-risk patient care scenarios. HFMS scenario focus can be on team interaction or patient safety, but in these scenarios, fidelity is often sacrificed.2

The Accreditation Council for Graduate Medical Education Outcome Project has shifted the focus of performance assessment from technical, scorecard-driven assessment to competency-based assessment, to involve skills not tested by traditional methods.3 Incorporation of resident performance on HFMS into the core competencies of EM has been proposed.2 Residency programs have been charged with developing tools by which these competencies may be assessed. HFMS has become a primary tool for this type of assessment.

EM and off-service residents rotating in the emergency department are infrequently assessed on their comprehension and implementation of clinical guidelines in a standardized manner due to patient care circumstances and high patient volume.4 HFMS may offer such a means of assessing the competency of residents in managing patients with complex illnesses such as severe sepsis and septic shock.5–10

Performance assessment has been used primarily for technical skills. Interrater reliability has previously been demonstrated to be high on such technical skills scorecards.11–14 Clinical skills are multifaceted, with both technical and nontechnical components necessary for modern practice. Nontechnical skills, such as cognitive reasoning, team building, and leadership, are often untested in traditional evaluation methods such as oral or written examinations and tend to be subjectively evaluated, when evaluated at all.2 Subjective evaluations can frequently produce poor interrater reliability.

HFMS provides one method of potentially assessing these skills in a more objective manner, although this has only been evaluated previously in a relatively limited manner. Research in the setting of EM has demonstrated the efficacy of HFMS in improving teamwork and likely patient safety.3,15 A nontechnical scorecard, previously utilized by Ottestad et al.16 was adapted to assess leadership, interpersonal interactions, and team building. High interrater reliability was demonstrated in this study as well as another similar nontechnical performance assessment via HFMS.17

Most HFMS sessions are currently performed at our institution with real-time assessment of technical and nontechnical skills. The need to manage the scenario on the simulator, as well as the need to run the simulator concurrent with evaluation, potentially confounds the evaluation process. Both video-based evaluation and real-time evaluation have been used successfully in an examinee-centered approach,2,11,12,18–21 but the two methods of evaluation have not been directly compared within the same scenario to determine reliability of assessment. We sought to determine the intermethod reliability of evaluation by real-time versus videotape review. Our hypothesis was that real-time assessment and videotaped evaluation scorecards of a septic shock HFMS scenario on both technical and nontechnical skills would demonstrate high levels of agreement on Bland-Altman plot analysis.

Methods

Study Design

This trial was conducted as a prospective, single-blinded, cohort, crossover educational study assessing the intermethod reliability of an HFMS scenario of septic shock. Approval for this study was provided by the facility institutional review board, and informed consent was obtained from all participants.

Study Setting and Population

The HFMS scenarios were conducted in a dedicated simulation laboratory on the METI Human Patient Simulator (HPS), software Version 4 (Medical Education Technologies, Inc., Sarasota, FL). Capture of video images was performed using a Microsoft NX-6000 Web camera with associated software (Microsoft, Inc., Redmond, WA), on a Hewlett-Packard dv1000 (Palo Alto, CA). Playback of video capture was performed on Microsoft Windows Media Player.

Participants were physicians in training in the San Antonio Uniformed Services Health Education Consortium (SAUSHEC) Emergency Medicine Residency Program and SAUSHEC Transitional Year Program who volunteered for the study. The sampling method was nonrandomized and consecutive and included a total number of 26 participants and four evaluators. Six of the 26 participants were third-year EM residents, 11 were second-year EM residents, and 8 were first-year EM residents. One resident was a first-year transitional resident.

Expert evaluators included four of the faculty of the Department of Emergency Medicine at the San Antonio Military Medical Center. Evaluators were faculty in the SAUSHEC EM residency program and were intimately familiar with the 2008 Surviving Sepsis Guidelines. An in-depth training session on simulator function, simulation scenario, evaluation method, and grading the technical and nontechnical scorecards was completed before evaluation began. All evaluators had participatory and evaluation experience with simulation. Evaluators were blinded to each other’s scoring, but not to the goals of the study.

Study Protocol

Residents were provided an HFMS scenario simulating the care of a patient suffering from septic shock. Residents were expected to provide care according to the principles of the Surviving Sepsis Guidelines,22 as they had been instructed in lecture-based format 4 months prior to the study, as well as in regular clinical practice. Each resident completed the simulation session only once. All resident simulation sessions were completed within 3 weeks.

Evaluators were not responsible for running the HPS, as this was performed by a trained technician, familiar with the simulation scenario and expected management course. Evaluators were not responsible for running the HPS to improve scoring accuracy and minimize distraction. Video equipment was managed by a separate investigator, not responsible for simulation evaluation.

Evaluators were required to complete real-time scorecards within 5 minutes of the conclusion of the simulation. Videotaped review assessments did not have an explicit time limit on review, and repeated viewing of simulation sessions was allowed.

Each of the four evaluators scored half of the resident performances in real-time and half of the other resident performances from a video recording. Each evaluator was randomly assigned to the method he or she used to evaluate the first 13 residents and then crossed over to the other method for the remaining 13 residents. Each evaluator performed only one assessment on each resident, resulting in a total of 26 observations for each evaluator. At the end of the trial, residents were provided with their video and the feedback from the scorecards. Performance on this scenario was not counted toward overall assessment of residency performance. Resident performance on the scenario was not released or discussed with the residency leadership in any form by any of the investigators.

Measurements

Evaluators scored residents’ technical and nontechnical skills using separate scorecards (Tables 1 and 2).16 A technical scorecard for assessing objective measures of patient care was derived from the recommendations of the 2008 Surviving Sepsis Guidelines.22 These evidence-based guidelines, supported by 11 major national and international specialty societies, provided a framework for our HFMS protocol and assessment templates.

Table 1. 
Sepsis Simulation Performance Checklist (Technical Scorecard)
Tasks (26 total) 
  1. *Critical action—must perform to pass.

  2. CBC = complete blood count; CVP = central venous pressure; EGDT = early goal-directed therapy; HCT = hematocrit; MAP = mean arterial pressure; PRBCs = packed red blood cells; ScVO2 = central venous oxygen saturation.

Identifies severe sepsis or septic shock*
Obtains measurement of serum lactate*
Obtains measurement of CBC
Obtains measurement of Chem 7
Obtains measurement of coagulation profile
Obtains measurement of random serum cortisol
Obtains blood and urine cultures before antibiotic administration*
Administers broad spectrum antibiotics in timely manner*
Places Foley catheter
Initiates peripheral intravenous access
Administers at least 500 mL normal saline bolus*
Recognizes lack of response to crystalloid administration
Initiates vasopressors
Verbalizes need for arterial pressure monitoring
Verbalizes desire to obtain MAP of at least 65 mm Hg
Verbalizes need for central venous access/presep catheter
Verbalizes need to measure central venous pressure
Verbalizes desire to obtain CVP of >8 mm Hg
Verbalizes desire to obtain ScVO2 > 70%
Administers PRBCs for HCT < 30% in setting of ScVO2 < 70%
Administers dobutamine for ScVO2 in setting of HCT > 30%
Considers endotracheal intubation, sedation, and paralysis if patient condition not responding to EGDT
Administers stress dose corticosteroids in setting of septic shock
Considers utilizing Xigris in setting of APACHE score of >25
Considers monitoring glucose to maintain 100–150 mg/dL
Considers lung protective ventilator strategy (6 mL/kg predicted body weight tidal volume)
Table 2. 
Nontechnical Scorecard
ItemRating
135
  1. Ottestad E, Boulet JR, Lighthall GK. Evaluating the management of septic shock using patient simulation. Crit Care Med. 2007; 35:769–75. Copyright 2007 Wolters Kluwer Health, used with permission.

Anticipation and planningPlausible and likely problem occurs but takes team by surpriseAble to anticipate some eventsAble to prioritize
Anticipates consplications
Unable to connect complication to primary diagnosisSomewhat able to plan aheadAudible evidence of plan so whole team knows plan
Communication (words leading to action)Resuest not acted on Barrage of orders that do net get completed
No follow-up on orders
Vague requests to the room that get done slowly
Poor follow-up
Specific/direct requests
Closed loop l will do that Makes sure it is done
LeadershipNo leader conflicts present, group alienated
Unfocused effort
No delegation or direction
Poor leader
No name recognition
Easily distracted task occupied
Strong leader identifiable, calm
Reevaluates problems, progress
Stands back for big picture
Information transferUnderlying problems not noted
Does not sign out primary problem, Focus on secondary problem (e.g., AMI)
Brief and focused HPI Some problems noted arid discussed
Some mention of primary problem
Signover reflects true reality and urgency of patient’s problems
Includes past treatments and current plan
Task distributionOverloaded personnel, people not being used
Inappropriate task pairing Tasks not completed (no one looking at monitor, chart not reviewed)
Most people have tasks
Most tasks are being completed
Appropriate delegation everyone has task according to ability
Makes sure all are comfortable doing chosen  task
Communication contentNo communication about priorities/problems
Wasting team attention with unimportant information
Communication present but does not focus team on taskWhole team knows plan and is able to prioritize accordingly
Information useInformation available hut ignored
Misses watching monitor
Does not obtain help/aids for gaps in knowledge
Some information used in management
Calls for help/consults
Able to grasp all input streams
Attention to all monitors/vitals
Fill gaps in knowledge with help

Cognitive tasks, which required a resident to consider an action, were scored on the technical scorecard as completed if either task completion or rejection was vocalized. Potential scores on the technical scorecard ranged from 0 to 26 points (Table 1). Actions were weighted equally throughout the assessment. On the nontechnical scorecard, potential scores ranged from 5 to 35 points (Table 2).

Data Analysis

SPSS Sample Power, Version 2.0 (SPSS, Inc., Chicago, IL), was used to estimate the sample size needed for a power of 80% with a level of confidence of 95%. As the different methods are tantamount to treatment groups, an effect size is an appropriate means of estimating sample size and power. Bland-Altman plot analysis was utilized to assess agreement between scores on the real-time and videotaped evaluations for both the technical and nontechnical scorecards.23 This was performed using MedCalc for Windows, Version 9.6.0.0 (MedCalc Software, Mariakerke, Belgium). Interrater reliability was then assessed using the intraclass correlation coefficient (ICC). Cronbach’s alpha was used to measure consistency among items on the scored scorecards (SPSS Version 16.0). Analysis of variance (ANOVA) was performed to determine if there was a difference between evaluators’ scores on the technical and nontechnical scorecards (SPSS Version 16.0). Pearson’s correlation was utilized to calculate a kappa value for agreement on technical scorecard items requiring vocalization of cognitive performance, the fifth to last, and the last three items on the technical scorecard (SPSS Version 16.0).

Results

Blinding of evaluators to other raters’ scores was successful. A total of 13 observations per evaluator for each evaluation method (26 observations/evaluator total) were compared. Bland-Altman plot analysis of both evaluation methods revealed substantial agreement between the real-time and videotaped review scores by reviewers. The mean difference between the reviewers was 0.0 (95% confidence interval [CI] = –3.7 to 3.6) on the technical evaluation (Figure 1) and –1.6 (95% CI = –11.4 to 8.2) on the nontechnical scorecard assessment (Figure 2). Comparison of videotape evaluations for the technical scorecard demonstrated a Cronbach’s alpha of 0.914, with an ICC of 0.842 (95% CI = 0.679 to 0.926), while the real-time technical scorecard demonstrated a Cronbach’s alpha of 0.899, with an ICC of 0.817 (95% CI = 0.633 to 0.914), demonstrating excellent intermethod reliability. Comparison of evaluations for the videotape nontechnical scorecard demonstrated a Cronbach’s alpha of 0.888, with an ICC of 0.798 (95% CI = 0.600 to 0.904), while the real-time nontechnical scorecard demonstrated a Cronbach’s alpha of 0.833, with an ICC of 0.714 (95% CI = 0.457 to 0.861), demonstrating substantial interrater reliability. The raters were consistent in agreement on performance within each level of training, as the ANOVA demonstrated no significant differences between the technical scorecard (p = 0.176) and the nontechnical scorecard (p = 0.367). No significant clustering of scores by evaluator was noted for the technical scorecard (Figure 3) or the nontechnical scorecard (Figure 4). There were no evaluator pairing effects, as the evaluators were assigned to the evaluation method of real-time assessment or videotape review randomly. It was found that for the first three cognitive tasks on the technical scorecard items, there was complete agreement that the residents did not perform the task. Item 26 demonstrated a poor agreement with a kappa value of 0.283.

Figure 1.

 Bland-Altman plot of intermethod agreement between real-time and videotaped evaluation of technical checklist. Demonstrates no difference between scoring with mean difference of 0.0.

Figure 2.

 Bland-Altman plot of intermethod agreement between real-time and videotaped evaluation of nontechnical checklist. Demonstrates nonsignificant difference between scoring with mean difference of –1.6 points.

Figure 3.

 Dot plot of technical checklist scores demonstrating a lack of clustering effect.

Figure 4.

 Dot plot of nontechnical checklist scores demonstrating a lack of clustering effect.

Discussion

As HFMS achieves increasing prominence as a method of instruction and evaluation in undergraduate and graduate medical education, the methods by which summative evaluation is performed become increasingly important. We found high levels of agreement between real-time and videotaped-based evaluation scorecards of a septic shock HFMS scenario for both technical and nontechnical skills.

As the technical scorecard is very action-driven and relatively objective, there was a smaller difference in technical scores between evaluation methods than with nontechnical scores. The nontechnical scorecard is less action-driven, and thus it may be more susceptible to scoring variations and to the effect of the repetitive viewing allowed in the videotape review method.

Of the four technical scorecard items considered cognitive tasks, only the final cognitive item on the technical scorecard (“Considers lung protective ventilator strategy (6mL/kg) predicted body weight tidal volume”) demonstrated a poor kappa value, suggesting that this item was difficult to interpret in the evaluation scenario, across methods of evaluation.

Performance assessment in simulation is difficult, especially of relatively subjective nontechnical factors. Expert consensus is the most frequently used method for setting competency standards. Test-centered methods incorporate expert consensus to determine what items and cut score are needed to pass the examination. Our technical and nontechnical scorecards are examples of a test-centered, expert consensus assessment. Using the standard scorecards, we were able to determine the similarity of scoring across two methods of assessment. Our goal in this effort was not to validate a scorecard or scenario, but to determine if the scoring on the scorecard was similar across methods of evaluation. Thus, determination of a cut point was not essential to our efforts.

In an attempt to standardize the nontechnical portion of the evaluation, the scorecard was adapted from Ottestad et al.,16 with explicit permission as displayed in Table 2. As this template had been used previously with reproducible results, we felt that in the absence of a validated protocol and evaluation scorecard for the skills that we sought to test, this was an acceptable alternative.

Our goal was not to explicitly determine a preferential method of evaluation, but to compare the reliability of the two methods of evaluation. We were unable to find a striking difference, suggesting that the choice of evaluation should be based upon factors other than reliability of assessment.

The opportunity to provide formative feedback via video, with advanced digital technology, has become an option. While cost-prohibitive for most programs, this technology allows the evaluator to mark specific examples of behavior within the evaluation scenario to critique. A video recording of a simulation session provides the learner with a “third-person view” of his or her behavior and an opportunity for self-critique. While this study did not address the use of this technology, the added cost, technical expertise, and time required for this method of evaluation and critique may be well worth it.

With the increased complexity, time, effort, and technical expertise required to perform video-taped review and evaluation of students, it may not be worth the effort, unless other considerations are present, such as evaluator nonpresence or multiple simultaneous evaluation episodes running concurrently, or there exists the need for a permanent record.

Limitations

Limitations include the lack of a prevalidated simulation scenario and evaluation scorecard and the necessity for subjective interpretation of nontechnical skills. Without a prevalidated HFMS scenario, the reliability of the data and conclusions outside of this single scenario may be limited, as the scorecard items and evaluation item wording and composition were specific to this scenario.

The resident subjects involved in this study were all volunteers, perhaps resulting in selection bias. This population of residents may theoretically be a different population from the residency population as a whole, with residents who have more confidence in their skills or being more interested in simulation being more likely to participate. However, as the residents were evaluated on a common scenario by four separate examiners, resident eagerness, confidence, or even performance should have had little effect on the comparison of the two evaluation methods

Subjective interpretation of some of the items on the nontechnical scorecard (i.e., some of those cognitive and teamwork tasks) may reduce the overall interrater reliability of the evaluation of a simulation scenario. Further study is needed to identify which specific tasks may result in unacceptable interrater reliability.

Conclusions

The intermethod reliability of real-time versus videotaped review evaluations of resident performance on a high-fidelity medical simulation septic shock scenario demonstrated substantial agreement among evaluators on both the technical and the nontechnical scorecards. In situations where cost, time, or technical expertise are limiting factors, real-time assessment of performance may be feasible, as it demonstrated minimal difference in scoring between the real-time and videotaped performance in our trial. Videotaping learner performance on HFMS scenarios, while not demonstrating a difference in scoring from real-time evaluation, offers real benefits, including the possibility of performance review and formative feedback, which may greatly improve the utility of HFMS in the educational and evaluation setting.

We acknowledge the following for their expertise and patience regarding this project: Dr. David Stamper, Director, Medical Simulation Center, Brooke Army; Mr. Robert Jones, Coordinator, Medical Simulation Center, San Antonio Military Medical Center–North; and Ms. Mida Gonzalez, Education Technician, Medical Simulation Center, San Antonio Military Medical Center–North

Ancillary