Summative Assessment in Medicine: The Promise of Simulation for High-stakes Evaluation


  • Presented at the 2008 Academic Emergency Medicine Consensus Conference, “The Science of Simulation in Healthcare: Defining and Developing Clinical Expertise,” Washington, DC, May 28, 2008.

Address for correspondence and reprints: John R. Boulet, PhD; e-mail:


Throughout their careers, physicians are exposed to a wide array of assessments, including those aimed at evaluating knowledge, clinical skills, and clinical decision-making. While many of these assessments are used as part of formative evaluation activities, others are employed to establish competence and, as a byproduct, to promote patient safety. In the past 10 years, simulations have been successfully incorporated in a number of high-stakes physician certification and licensure exams. In developing these simulation-based assessments, testing organizations were able to promote novel test administration protocols, build enhanced assessment rubrics, advance sophisticated scoring and equating algorithms, and promote innovative standard-setting methods. Moreover, numerous studies have been conducted to identify potential threats to the validity of test score interpretations. As simulation technology expands and new simulators are invented, this groundbreaking work can serve as a basis for organizations to build or expand their summative assessment activities. Although there will continue to be logistical and psychometric problems, many of which will be specialty- or simulator-specific, past experience with performance-based assessments suggests that most challenges can be addressed through focused research. Simulation, whether it involves standardized patients (SPs), computerized case management scenarios, part-task trainers, electromechanical mannequins, or a combination of these methods, holds great promise for high-stakes assessment.

The use of assessments in medicine has a long history, dating back several decades.1,2 Their utility, both for evaluating competencies and for identifying curricular deficiencies, is well documented.3 With new developments in technology, significant efforts have been directed at delineating the role of various simulation modalities in instructional activities, many of which include a student-, resident-, or practitioner-assessment component.4–8 Recently, based primarily on concerns related to physician competency and patient safety, various summative assessments, including those specifically targeting performance domains, have been incorporated into the process used to license and certify physicians. In contrast to formative assessments, including those that employ high-fidelity patient simulators,9,10 where the primary goal is to provide feedback to the individual concerning strengths and weaknesses, summative evaluation activities are meant to determine some end point status (e.g., competent/ not competent, ready to practice independently). Appropriately, these types of assessments, in addition to focusing on the evaluation of knowledge and clinical reasoning, have targeted other important competencies, such as patient care, and interpersonal and communication skills.11 In building these assessments, it has been necessary to embrace various simulation modalities, including computer-based modeling of clinical environments, standardized patients (SPs), task trainers, and various hybrid models.12,13

Although simulations, in a broad context, are used as part of specialty board certification14,15 and have recently been introduced as part of maintenance of certification activities,16 their primary use in the United States, at least based on volume, has been for the certification and licensure of physicians.17,18 Starting in 1999, the United States Medical Licensing Examination (USMLE™) introduced computer-based case simulations as part of the Step 3 assessment.19 Step 3, which consists of multiple-choice items and computer-based case simulations, is designed to assess whether a physician can apply the medical knowledge and understanding of biomedical and clinical science essential for the unsupervised practice of medicine. In 2004, the USMLE introduced the Step 2 Clinical Skills (CS) examination.20 Before this, the Educational Commission for Foreign Medical Graduates (ECFMG) had developed and administered the Clinical Skills Assessment, a multistation SP-based evaluation designed to assess the readiness of international medical graduates to enter graduate medical education (GME) programs in the United States.21 Similarly, the National Board of Osteopathic Medical Examiners (NBOME) introduced the Comprehensive Osteopathic Medical Licensing Examination (COMLEX) Level 2 Performance Evaluation in 2004.22 These multistation SP assessments all use, or used, simulated clinical encounters to assess physicians’ clinical skills, including their ability to perform history taking, physical examination, and communication, both oral (with the patient) and written (with the health care team). Taken together, the introduction of simulation-based certification and licensure examinations has stimulated many changes in medical training programs, including the introduction of a host of curricular changes, a new-found emphasis on the importance of communication skills, and arguably, more effective learning.23–26

In developing high-stakes summative simulation-based assessments, much has been learned about exam design, test administration and logistics, quality assurance, and psychometrics.27,28 With respect to exam design, efforts have been made to model simulation scenarios to specifically measure certain skills and to do this in a realistic way by choosing simulated patient complaints based on actual practice.29 Sophisticated computer programs have been developed to schedule examinees and to monitor their performance throughout the examination. Based on the need to test a large number of examinees (e.g., >30,000/year for the USMLE Step 2 CS), standard procedures were created to develop and test simulation scenarios, recruit and hire SPs, and set up and staff examination centers. Since these summative examinations have high-stakes consequences (i.e., examinees must eventually pass to be eligible for a license to practice medicine), quality assurance is paramount.30 The ability to accurately separate those who are competent from those who are not is critical. The accuracy of these types of decisions, which depends on the psychometric properties of the measures,31,32 has been studied in detail. As testing practices evolve, and new simulation modalities emerge, they must be similarly scrutinized with respect to the reliability of the scores (or decisions made based on those scores), the validity of the inferences one can make based on the scores, and their overall fairness.

It is clear that simulation-based methods have shown great promise as summative evaluation tools. Nevertheless, as the practice domain expands, there will continue to be many challenges. The purpose of this article is to provide an overview of some of the obstacles that would need to be overcome if summative, simulation-based assessments are to be introduced or augmented in other areas and other disciplines such as emergency medicine (EM). Four general areas will be highlighted: 1) defining the skills and choosing the appropriate simulation tasks, 2) developing appropriate metrics, 3) assessing the reliability of test scores, and 4) providing evidence to support the validity of test score inferences. The discussion will center on psychometric issues and not those associated with test administration logistics (e.g., candidate scheduling) or physical test site specifications. Fortunately, many psychometric and logistical obstacles have already been addressed, albeit not perfectly, in currently employed summative assessments, including those used for the certification and licensure.20,27 Thus, further expansion of simulation-based summative assessments into other areas (e.g., specialty board examinations, continuing medical education activities) may not be as daunting as first thought.

The Promise of Simulation-Based Assessment: some Key Issues

Defining the Skills and Choosing the Appropriate Simulation Tasks

Although much has been written about the design of simulation-based educational programs33,34 and the development of mannequin simulators for clinical education and training,8,35 the construction of quality assessments can still be an onerous task. Test developers must keep in mind the intended purpose of the test, the knowledge and skills to be evaluated and, for performance-based activities, the specific context for, and design of, the particular exercises.36 Ideally, assessment activities should be targeted at the ability level of the examinee.37 When the purpose of the assessment is not clear, the chosen tasks often yield poor, or inappropriate, measures of ability. As an example, a simulation-based assessment designed for selecting residents into a GME program could be quite different, in terms of both tasks and task difficulty, from one that was constructed for board certification needs. With respect to the knowledge and skills to be evaluated, this is usually guided by curricular information, competency guidelines,38–40 and the technical limitations of the chosen simulators.41 Once these evaluation content issues have been identified and synthesized, the test developer must specify the simulation parameters. Most important among these is choosing the particular scenarios that offer the best opportunity to sample the knowledge and skills one wishes to measure. For SP-based assessments, this process has been aided by accessing national health care databases and modeling scenarios based on common reasons for visiting the doctor.42 For EM or other acute care disciplines, one could also utilize available data resources such as the National Hospital Ambulatory Medical Care Survey (NHAMCS),43 but this may unnecessarily restrict the assessment domain, especially if nonprocedural skills are being evaluated. Often, rare, reasonably complex events will provide the best opportunity to assess specific skill sets, such as clinical decision-making and communication. As a result, special care must be taken to choose the appropriate simulation scenarios, ones where the intended knowledge and skills can be best measured. An effective strategy for existing performance-based certification and licensure examinations has been to utilize both health care data resources and expert judgment.

With the rapid development of simulator technology, including full-body mannequins and part-task trainers, the assessment domain has greatly expanded, in terms of both skills being measured and tasks that can be modeled.44–46 With SPs, or even “real” patients, it is often difficult, if not impossible, to measure procedural and management skills. Mannequins and part-task trainers can be used to measure specific therapeutic skills (e.g., airway management, venipuncture techniques, administering drugs) and, in combination with other simulation modalities such as SPs, abilities related to resource management, professionalism, and teamwork.12,13,47 Similar to the expansion of knowledge-based item testing formats, the introduction of new simulation modalities provides an opportunity to measure various skills in different and more realistic ways, a change that should eventually yield more robust and defensible assessments.

While the introduction of new simulation modalities will expand the assessment domain, there are some limitations with current technologies, many of which have been acknowledged in the literature.48 First, even with the most sophisticated high-fidelity mannequins, some presentations and patient conditions cannot be modeled very well (e.g., sweating, changes in skin color, response to painful stimuli). As a result, there will still be a need to incorporate assessment activities involving direct patient contact. Second, for electromechanical mannequins, the interrelationships between different physiologic variables may be imperfect, especially when combined interventions are attempted. If the simulator responds unpredictably to a given intervention (e.g., coadministration of an induction agent and an analgesic), it will be difficult to have any confidence in the assessment. Moreover, to the extent that those being assessed are continually queued by changes in monitored output, improperly scripted or modeled scenarios will provide a poor milieu for evaluation. Those charged with developing simulation based assessments must balance the need to measure specific abilities, with technologic limitations of the simulator(s), recognizing that many conditions cannot be simulated very well, costs can be prohibitive, and stakeholder buy-in is essential.49,50

Developing Appropriate Metrics

If simulation-based assessments are to be used for summative assessment activities, it is essential that appropriate metrics are employed. Developing rubrics is certainly one of the main assessment challenges. Although much has been learned from the development of performance-based assessments of clinical skills,51 its adaptation to those types of simulations appropriate for EM or other acute care disciplines is not without problems. With this in mind, efforts to develop scoring metrics for high-fidelity simulators are currently expanding.52–57

Based on prevailing SP literature, and more specifically that related to licensure and certification examinations, both analytic and holistic (or global) scoring metrics have been employed.58 The prevailing methodology for analytic scores involves the use of checklists. For a typical clinical skills simulation scenario, checklists can be constructed to tap explicit processes and measure domains such as history taking and physical examination. Here, committees are employed to determine specific checklist content, often based on evidence-based criteria.52,54,59,60 For example, given the patient’s (simulator’s) presenting complaint(s), checklist items may include the history taking questions that should be asked, the physical examination maneuvers that should be performed, and the management strategies that should be employed. Depending on the nature of the clinical simulation, some items (actions) can be weighted more heavily than others. Although checklists have worked reasonably well and have provided modestly reproducible scores depending on the number of simulated scenarios, they have been criticized for a number of reasons. First, checklists, while objective in terms of scoring, can be subjective in terms of construction.61 While specific practice guidelines may exist for some conditions, there can still be considerable debate as to which actions are important or necessary, given the patient’s complaint or reason for visiting the physician. Without expert consensus, one could question the validity of the scenario scores. Second, the use of checklists, if known by those taking the assessment, may promote rote behaviors such as employing rapid-fire questioning techniques. To accrue more points, examinees may ask as many questions as they can and/or perform as many physical examination maneuvers as are possible within the allotted time frame. Once again, this could call into question the validity of the scores. Third, and likely most relevant for acute care simulations, checklists are not conducive to recording/scoring the timing or sequencing of tasks, especially if the simulation scenario requires a series of patient management activities. In EM, one can envision many scenarios where the order and timing of physician actions are critical. Although checklist-based timing has been employed in some evaluations,62,63 the order of actions, at least for analytic-based scoring, is often ignored completely.

Holistic scoring, where the entire performance is rated as a whole, can also be employed in summative-based simulation assessments. Although there is often considerable reluctance to employ rating scales, they can effectively measure certain constructs, especially those that are complex and multidimensional, such as communication and teamwork.64,65 Unfortunately, throughout medicine there seems to be a preference to use “less subjective” measures, such as checklists and key actions, even though psychometric properties of global rating scales are often adequate.55,66 Avoiding rating scales, however, may not be prudent, since rating scales can be constructed to evaluate implicit processes. Rating scales allow raters to take into account egregious actions and/or unnecessary patient management strategies, something that is difficult to do with checklists.67 Although two raters watching the same encounter may not produce the exact same score, or scores, it is often possible to minimize this source of error. In addition, where systematic differences in rater stringency exist, score-equating strategies can be employed.68 Therefore, as highlighted later in this article, and depending on what skill is being evaluated, one may prefer to sacrifice some measurement precision to achieve greater score validity.

Often, when developing rating scales, evaluators concentrate solely on the rubric (i.e., specification of the constructs that are going to be measured, deciding the number of score points, benchmarking certain score categories), ignoring any rater training or quality assurance regimes. Although raters may be content experts, this does not necessarily qualify them to be evaluators, especially for high-stakes summative assessments. In most instances, regardless of their qualifications, evaluators need to be trained to use holistic rating scales. Most, if not all, simulation-based certification examinations have well-defined, written, training protocols. These protocols can include specific rater exercises (e.g., rating benchmarked taped performances), various quality assurance measures (e.g., double rating a sample of examinee performances), and periodic refresher training. By developing a meaningful rubric and assuring the provision of ratings that accurately reflect examinee abilities, it is possible to minimize bias and produce more valid performance measures.

The introduction of technologically sophisticated mannequin-based simulators, combined with an impetus to create physician-specific ability measures,69,70 offers the opportunity to develop new, perhaps more valid, assessment metrics. Many of the available electromechanical devices can generate machine-readable records of the physiologic responses of the mannequin. Provided that the mannequin responds realistically and reproducibly to any interventions (e.g., ventilation, drug therapy), and the timing of the actions can be demarked, then it should be possible to develop explicit performance measures that are based on patient (simulator) outcomes. However, developing these types of scoring metrics will require some work. For many acute care situations, there may be multiple, often equally appropriate, ways to manage a patient’s condition. Moreover, determining the relative importance of certain patient outcomes, especially for short simulation scenarios, may be quite difficult. Nevertheless, it seems appropriate, even within a simulated environment, to base the physician’s performance, at least to some extent, on how the patient (simulator) responds to various therapeutic interventions.

Assessing the Reliability of Test Scores

For a simulation-based assessment to be employed for summative purposes, testing organizations need to be reasonably certain that the scores are reliable. Compared to typical multiple choice examinations, there are many more sources of measurement error in a typical simulation-based assessment, including those associated with the raters.32,71 Without acknowledging the individual and combined sources of error, one often gets an incomplete picture of reliability. For example, if checklists or key actions are used to generate scores for a simulation scenario, an internal consistency coefficient may be calculated.60 While this can be presented as a reliability measure for a multiscenario assessment of clinical skills, the consistency of examinees’ scores over encounters is a greater concern than consistency within each individual encounter. Typically, some measure of interrater reliability is also calculated.72,73 If two raters provide scores or ratings for a given station, one can simply correlate the two measures using various statistical techniques. While scoring consistency within a simulation encounter is certainly important, relying solely on a scenario-based measure of agreement is also incomplete. While two raters may be somewhat inconsistent in their scoring, this will not necessarily lead to an unreliable total assessment score, as long as there are a number of independently rated simulation tasks. To better understand the sources of measurement error in a multiscenario performance-based simulation assessment, generalizability (G) studies are often employed.74,75 These studies are conducted to specifically delimit the relative magnitude of various error sources and their associated interactions. Following the G-study, decision studies can be undertaken to determine the optimal scoring design (e.g., number of simulated encounters, number of raters per given encounter).

Within the performance assessment literature, numerous studies have been conducted to estimate the impact of various designs on the reproducibility of the scores.76 Although raters have been identified as a source of variability, their impact on reliability, given proper training and well-specified rubrics, tends to be minimal, often being far outweighed by task sampling variance. Essentially, given the content specificity of certain simulation tasks, examinees may perform inconsistently from one simulation scenario to the next. As a result, if there are few tasks, the reliability of an assessment score can be poor. For example, if we are trying to assess patient management skills, fewer performance samples (simulated scenarios) will exacerbate the overall impact of content specificity, thus yielding reduced overall precision. In general, for these types of performance-based assessments, issues regarding inadequate score reliability can be best addressed by lengthening the assessment (i.e., increasing the number of simulated tasks) rather than increasing the number of raters per given encounter. To minimize any rater effects, it is usually most effective to employ as many different raters as possible for any given examinee (e.g., a different rater for every task).28

A summative simulation-based EM assessment would likely be similar in design to existing high-stakes clinical skills assessments. There would be multiple scenarios sampled from the practice domain, both content-specific (e.g., key action checklists) and task-invariant scales (e.g., communication, teamwork, professionalism), and standardized exam administration conditions. However, unlike the common clinical encounters modeled in the current SP-based licensure examinations, the simulated EM encounter would be expected to be even more task specific, at least in terms of patient management activities. If this is true, and one wants to measure skills related to patient management, then it could take many more encounters to achieve sufficient score reproducibility. Fortunately, many events in EM, including a large number that can be effectively modeled in a simulated environment, require fairly rapid interventions. Unlike typical SP-based cases, which usually last from 10 to 20 minutes, acute care scenarios can easily be modeled to take place in a 5-minute period. Since the simulation scenarios can be relatively short, it is possible to include more of them in a given assessment. Increasing the number of simulated encounters or tasks is probably the most efficient way to increase the reliability of the overall assessment.

Providing Evidence to Support the Validity of Test Score Inferences

The validity of a simulation-based assessment relates to the inferences that we want to make based on the examination scores.77–79 Looking at the simulation literature in general, and the research related to high-stakes SP examinations in particular, there are numerous potential ways to assess the validity of test scores. One should note, however, that the validation process is never complete; additional evidence to support the intended test score inferences can always be gathered.

For performance-based assessments, most notably those used for certification and licensure, content-related issues have been emphasized.80–82 To support the content validity of the assessment, simulated scenarios have been modeled based on actual practice characteristics, including the types of patients that are normally seen in ambulatory settings. If acute care scenarios are being modeled, a similar strategy of referencing emergency care data can easily be employed. With respect to rubrics, special care is taken to define the specific skill sets and measures that, from an evidenced-based perspective, adequately reflect them. Finally, the encounters are modeled in realistic ways, utilizing the same equipment that would be found in a real clinic. All of these strategies, in addition to positive feedback from stakeholders regarding the verisimilitude of the assessment activities,83 will help support the content validity of the test scores.

If a simulation-based assessment is designed to measure certain skills, then it is imperative that data be obtained to support this claim.40 This process is generally referred to as construct validation, and various strategies can be employed to gather suitable evidence. If several skills are being measured, then one could postulate relationships among them. For example, if the simulation is designed to measure both data gathering and communication skills, one would expect the scores for these two domains to be somewhat related (e.g., better communicators should be able to gather more information). Likewise, if external evaluations are available (e.g., knowledge examination, course grades) one might postulate both strong and weak relationships between the simulation assessment scores and these criterion measures. Often, the criterion measure is some measure of clinical experience. Here, one would normally expect individuals with greater expertise (e.g., more advanced training) to perform better on the simulation tasks.53,63,84–86 If this is not the case, then one could question whether valid inferences can be made based on the assessment scores. Overall, to the extent that postulated internal and external associations substantiate the hypothesized relationships, additional support for the validity of test score interpretations can be gathered.

For simulation-based medical assessments, at least those with high-stakes outcomes such as licensure or certification, the purpose of the evaluations is, most often, to assure the public that the individual who passes the examination is fit to practice, either independently or under supervision. The score-based decisions must be validated and demonstrated to be reliable using a variety of standard techniques, some of which have been applied for acute care mannequin-based assessments.87 It should be noted, however, that while defensible cut scores can be established for performance-based assessments, procuring evidence to support the validity of the associated decisions can be complicated.88 Like a driver’s examination, a properly constructed performance-based simulation assessment demonstrates what a person can and cannot do at one point in time. Although this may be indicative of future performance or competence,89 the predictive relationships may be weak and difficult to measure.90 Just as passing a state driver’s examination will not totally eliminate the possibility of reckless driving, driving while intoxicated, or speeding, simulation-based assessment of acute care skills will not do away with medical errors. It should, however, assure the public that practitioners are qualified91,92 and, based on the consequential impact of other previously implemented assessments, ultimately lead to greater patient safety.93–96


Summative assessments are used throughout the lengthy course of training physicians, from medical school admission to board certification and even recertification. As part of this process, simulation has taken a central role and continues to expand the assessment domain. The introduction of simulation-based assessments, especially those used as part of physician certification and licensure, has demanded fairly rigorous studies of exam administration protocols, scoring and equating models, and standard setting. Most important, major efforts have been directed at identifying and addressing potential threats to validity of simulation-based assessment scores. As a result, much of the testing groundwork has been completed. Thus, organizations wishing to incorporate simulation-based summative assessments into their evaluation practices have access to information regarding effective test development practices, selection and construction of appropriate metrics, minimization of measurement errors, and test score validation processes.

With the expansion of simulation models, including those employing mannequins and part-task trainers, and a continued emphasis in medicine on measuring specific competencies, the use of summative assessments is likely to expand. As evidenced through the literature on SP assessments and computer-based simulations, this expansion will necessitate operations-based research. Although many psychometric issues have been resolved, or at least addressed, efforts to assess skills more broadly, combined with the incorporation of new technologies and types of simulation scenarios, will certainly demand new metrics and the associated studies to support their adequacy. Fortunately, much of this work is now taking place and the results are quite promising. As a result, high-stakes simulation-based evaluations are likely to expand into new areas, including specialty board assessments and various CME activities.