Psychometric characteristics of simulation-based assessment in anaesthesia and accuracy of self-assessed scores*


  • *

    Presented in part at the Society in Europe for Simulation Applied to Medicine (SESAM) Annual Meeting, Stockholm: June 2004, and as an invited talk at the Simtect Health and Medical Simulation Symposium, Canberra, Australia: May 2004.

J. M. Weller


The purpose of this study was to define the psychometric properties of a simulation-based assessment of anaesthetists. Twenty-one anaesthetic trainees took part in three highly standardised simulations of anaesthetic emergencies. Scenarios were videotaped and rated independently by four judges. Trainees also assessed their own performance in the simulations. Results were analysed using generalisability theory to determine the influence of subject, case and judge on the variance in judges' scores and to determine the number of cases and judges required to produce a reliable result. Self-assessed scores were compared to the mean score of the judges. The results suggest that 12–15 cases are required to rank trainees reliably on their ability to manage simulated crises. Greater reliability is gained by increasing the number of cases than by increasing the number of judges. There was modest but significant correlation between self-assessed scores and external assessors' scores (rho = 0.321; p = 0.01). At the lower levels of performance, trainees consistently overrated their performance compared to those performing at higher levels (p = 0.0001).

Assessment of clinicians using simulators has attracted attention as it appears more closely aligned to performance in the real world than traditional methods of assessment. Simulators can be used to assess aspects of anaesthetic practice that would be difficult to assess by direct observation, such as the ability to manage a crisis.

In terms of reliability, several studies have examined the level of agreement between judges on an anaesthetist's performance in the simulator, and have shown that it is possible to generate reliable scores for a single performance with two to three judges [1–3]. In terms of validity, face validity is supported by questionnaire data on realism. More experienced anaesthetists score more highly than less experienced anaesthetists [4–6], supporting construct validity. Itemised checklists and rating scales have been used to score performance, although there is convincing evidence that rating scales of overall, or global, performance are more valid and reliable than checklists for rating complex tasks [7].

We know from other domains that performance varies between cases, and large numbers of cases and long test times are required to generate reproducible results for an individual candidate [8]. In an Objective Structured Clinical Examination (OSCE) or the long case in a medical viva, extensive testing time is required for a reliable result. A recent study of acute care skills in medical students and junior doctors found that simulation was a valid and reliable assessment tool, but multiple encounters were required to ensure a reliable result [9].

A candidate's score in an examination will have a number of sources of variance: the candidate him- or herself; the particular case; the judge; and the interaction between all these components. Where the purpose of the assessment is to rank the candidates in order of ability, ideally the candidate should be the largest source of variance. No previous studies have investigated how many simulated cases a candidate should undertake before it could be confidently said that the final score truly reflected his or her ability. Nor do we know the optimum number and arrangement of cases and judges to produce a reliable assessment in the simulator. An additional unstudied area is the ability of anaesthetists to assess their own performance in the simulator. There is evidence that, in general, self-assessed scores are inaccurate and correlate poorly with externally assessed scores, suggesting that doctors are not always able to identify their own deficiencies. Identifying deficiencies in knowledge and practice is a fundamental motivator to continuing professional development. Evidence suggests that improved accuracy of self-assessment leads to improved learning outcomes. This may be relevant to the emerging field of simulation-based learning.

The aim of our study was to determine the following: the effect of candidate, case and judge, and the interaction between each of these components, on the variance in scores; the number of cases and judges required for a reproducible assessment; the reliability of different test formats; and the accuracy of self-assessment.


Ethical approval was obtained from the Wellington Ethics Committee. Written informed consent was obtained from all participants and all agreed not to disclose the content of the case scenarios. They were assured that their performance would remain confidential within the study group. There were 22 trainees in the region at the time, with 1–5 years' experience in anaesthesia. All had some experience with simulation. All were invited to participate.

We used the METI (Medical Education Technologies Incorporated, Sarasota, FL) human simulator placed in a simulated operating room, with an anaesthetic machine and monitoring familiar to the trainees. Three tightly scripted and highly standardised simulator cases were developed, each lasting 15 min, involving an emergency developing during the course of an operation. The emergencies were anaphylaxis following induction of anaesthesia, oxygen pipeline failure in a critically ill patient, and cardiac ischaemia followed by cardiac arrest. The scenarios were developed from those used in a previous study which found them to be of similar difficulty [2].

Trainees underwent a structured 30-min familiarisation with the simulator and a scripted introduction to the case. Faculty staff played the roles of the operating room team and were scripted to give key information at set times, and to provide help appropriate to their role and only in response to a request from the trainee. Major events occurred in a set timeline, but the course of the scenario varied in response to the interventions of the trainee. Prompts from faculty were not permitted. External help could be requested but would not arrive during the course of the scenario.

The order of the scenarios was randomised to counteract any sequencing effect. There was a break of 10–15 min between each scenario. All discussion was reserved until the study was completed. Following the final scenario, trainees completed a questionnaire seeking their ratings on a five-point rating scale for realism of the cases, the extent to which their performance in the simulator was a reasonable measure of how they would manage the case in real life, and the value of the scenarios as a learning experience. Trainees were asked to rate their performance in each of the three cases, using a 5-point rating scale for overall performance with a simple descriptor for each level of performance (1 = unsatisfactory, 2 = borderline, 3 = satisfactory, 4 = good, 5 =excellent). At the completion of the study, trainees discussed their performance with an experienced simulation centre instructor who provided feedback.

The scoring system was developed by four experienced anaesthetists in a previous study, where a draft, 5-point rating scale was used over 10 videotaped scenarios and refined through a process of rating and consultation. The form subsequently demonstrated acceptable intermarker agreement when used to assess the performance of anaesthetists in videotaped scenarios [2]. In the present study, we developed a list of written criteria for expected medical management for each case, which was agreed on by the study judges. The 5-point rating scale had descriptors of each level of performance, the minimum requirements for an acceptable performance, actions required for an outstanding performance, and actions that would result in a fail. Four judges gave a score for overall performance in each of the scenarios based on their expert opinion, guided by the agreed criteria and standards.

To minimise prior knowledge of the New Zealand trainees, we chose judges from Australia. Australia and New Zealand have a joint College of Anaesthetists and identical training requirements, ensuring that the judges had similar expectations of trainees. All judges were specialist anaesthetists with experience in simulation. They received a written explanation of the study and the rating process, protocols for rating, criteria for assessment, and scoring sheets. Two training meetings and four teleconferences were held as part of examiner training, but at no point were judges aware of the scores of others before rating any tape. The tapes of the scenarios were anonymised, randomised and rated in the same order at independent sites. All four judges rated the full 15 min of all cases.


Results were analysed using generalisability theory with the General Purpose Analysis of Variance (Genova) statistical package. This makes use of all data to quantify all sources of error (subject, case, judge, and interactions between these components) and their relative contribution to the variance in scores [10]. The separate sources of variation were combined to express the extent to which differences between trainees reflect reproducible differences. This results in a coefficient between 0 and 1 called the generalisability coefficient (G). A value for G of 0.8 is the accepted level of reliability required for a high stakes assessment.

As the study was a fully crossed design for candidate, case and judge, all sources of error could be quantified. This allowed generation of the ‘D’ study [10], analogous to a power calculation in an intervention trial, mathematically modelling G in different hypothetical assessment formats. We could thus determine the most efficient use of cases and judges to produce a reliable assessment, if an acceptable level for G was set at 0.8.

An additional analysis was undertaken to determine the reliability of the overall judges' score for a single scenario in order to estimate the reliability of individual self-assessed scores. Correlation of judges' scores was assessed using a one-way random-effects anova model. The Spearman rank correlation test was used to compare self assessed and external scores.


Twenty one of the 22 eligible trainees participated in the study (one was unable to attend due to service commitments). All 21 trainees completed all three scenarios. The tape of one scenario was lost due to recording difficulties, resulting in 62 tapes that were all rated by the four judges. All trainees completed the questionnaire and self assessment.

The variance components for trainees, judges and cases, and the interactions between them, are shown in Table 1. The trainee variance component is an estimate of the variance between trainees on mean scores. As the purpose of the assessment is to detect differences between trainees, most variance should occur here. The case component variance here is zero, indicating very low variation in the overall mean scores for cases, and therefore that the cases were of equal difficulty. The largest variance component was the interaction between trainee and case, indicating that the rank orderings of different trainees varied with each case. In other words, an individual trainee's performance varied across the three cases.

Table 1.  Variance components for candidate, case and judge, and the interaction between combinations of these effects, indicating how these effects and interactions contribute to the final score awarded to the candidate. Values are mean (standard error).
Effect under studyVariance component
Trainee0.2758041 (0.1537267)
Case0.0 (0.0149178)
Judge0.0660819 (0.0660819)
Trainee, case0.3712719 (0.1274148)
Trainee, judge0.0575292 (0.0655307)
Case, judge0.0002924 (0.0195556)
Trainee, case, judge0.7510965 (0.0986238)

The variance component due to the judge was relatively low, indicating that the judges were of similar average stringency. The variance due to the interaction between judge and trainee was also relatively low, suggesting that the different judges ranked trainees similarly. The final variance component is the total variance in the scores due to the combined effect of judge, case and trainee.

The generalisability coefficient for the test format in the study (three cases, four judges) was 0.58, indicating that a test format with three cases and four judges does not produce a reliable score for examinees. With this fully crossed design (where all trainees undergo the same cases and all cases are marked by the same set of judges) we could estimate that G would approach the accepted level of 0.8 only where 10–12 cases were included with three to four judges each rating all the cases. Increasing the number of cases increased confidence in the assessment to a much greater extent than increasing the number of judges, reflecting the relatively larger contribution made by variation in trainees' performance between cases (Table 2).

Table 2.  Generalisability coefficients (G) for a selection of different numbers of cases and judges. In the lower right hand corner, G approaches an acceptable level of reliability (0.8), indicating an acceptable test format.
No. cases1 judge2 judges3 judges4 judges5 judges

Using a nested design (the judge is ‘nested’ in, or attached to, the case) for the D study where a different judge marks only a single case, a G of 0.78 was achieved with 12–15 cases. The implication here is that a test format where a different judge was assigned to mark a single case would be as reliable as four judges marking all cases. Fifteen judges would be required, but they would each only have to mark the trainees in a single case (Table 3).

Table 3.  D study for nested design: one judge is assigned to mark each case, and G (generalisability coefficient) indicates the reliability of the assessment with increasing numbers of cases.
No. casesG

All scenarios were rated 3 or higher for realism by 94% of trainees, and 64% of trainees rated the realism as 4 or 5 on a 5-point scale where 5 indicated most realistic (Fig. 1). Trainees on the whole felt that the scenarios were a good test of their ability to manage these events in real life (Fig. 2). Trainees rated the simulator assessment scenarios very highly in terms of a learning experience: median (IQR [range] 4 (4–5 [2–5]) where 5 indicates very valuable (21 responses).

Figure 1.

Trainees' ratings of realism of the O2 failure (white), anaphylaxis (black) and cardiac arrest (speckled) scenarios (1 =not realistic to 5 = very realistic).

Figure 2.

Trainees' ratings of how well the O2 failure (white), anaphylaxis (black) and cardiac arrest (speckled) scenarios assessed their ability to manage the event in real life (1 = not at all well to 5 = very well).

An estimate of 0.40 (95% CI 0.27–0.54) was obtained for the intraclass correlation for the individual ratings of the four judges. The estimated reliability of the mean of the four judges for each trainee's performance was considerably higher at 0.73, an acceptable level of reliability. There was a modest but significant correlation between self-assessed scores and the mean judges' scores for the 62 scenarios (rho = 0.321; p = 0.01). The self assessed and external scores were the same in 15 scenarios (24.2%), and fell within one point on the 5-point rating scale in 33 scenarios (53.2%). Twelve scenarios (19.4%) differed by 1.5–2 points on the scale, and two (3.2%) differed by 2.5–3 points.

Participants whom the judges scored low overrated their performances, while those the examiners scored higher underrated their performances. This relationship is demonstrated in an inverse correlation between the external score and the difference between the self-assessed and external score (rho −0.614; p < 0.0001). In 18 scenarios, the mean judges' score was≤2.5 on the 5-point scale. In 15 (83%) of these low scoring scenarios, the participants awarded themselves a higher score than did the judges. In the 44 higher scoring scenarios, where the judges' score was > 2.5, only eight (18%) of the participants awarded themselves a higher score than the judges (p < 0.0001).


We determined that 10–15 cases, or 3–4 h, are required to rank trainees reliably in their ability to manage simulated anaesthetic emergencies. It is more feasible and equally reliable to allocate one judge to mark all trainees in only one case, and have 15 cases, than to have four judges each marking every trainee in 10–12 cases. We found a modest but significant correlation between trainees' scores of their performance and judges' scores. There was a significance difference in self-assessment in high and low scoring scenarios, and in scenarios where performance was poor, trainees scored themselves significantly more highly than did the judges.

The results of our study are consistent with comparable studies in other areas. Wass et al.[11] compared final year medical students' performances in history taking across two randomly chosen unstandardised cases. The generalisability coefficient for a single long case with a single judge was only 0.3. They estimated, using generalisability theory, that 8–10 cases would be required to produce generalisability coefficients above the acceptable level of 0.8. Again using generalisability theory, Boulet et al.[9] found that six simulations were not enough to test acute care skills in medical undergraduates and recent graduates. A large number of cases is required because performance varies between cases, and performance in one case is not a good predictor of performance in another [8]. This applies to written patient management problems, computer-based simulations, standardised patients and real patients [12–14]. The specific knowledge and experience that students bring to a case seem to be more important than general problem solving skills.

Increasing agreement between judges improves test reliability, and is possible in simulations where very specific guidelines for diagnosis and treatment exist. Boulet et al.[9] found high levels of agreement between judges when assessing the acute care skills of medical students and recent graduates. The cases lasted 5 min, with specified criteria for evaluation, diagnosis and treatment. Morgan and Cleave-Hogg also generated high levels of agreement between judges when assessing medical students in simulated anaesthesia problems with clear treatment pathways [15]. In contrast, Gaba et al.[1] found that agreement between judges was lower when assessing anaesthetists in simulations where there was more than one possible treatment pathway and where performance varied over time. In a previous study [2], we found that when assessing experienced practitioners over 30-min simulations, reasonably reliable scores could only be achieved by using the mean score of three judges. Improving examiners' agreement is not easily achieved and requires training, regular practice and feedback [16], and this may not be the most efficient way of improving test reliability. Newble and Swanson [8] showed that, in the context of OSCEs, using twice as many cases and one rater per station increased reliability to a much greater extent than using two raters per case. This is consistent with the results of our study.

Although previous studies have assessed aspects of simulator performance in anaesthetists [1–6, 17, 18] and medical undergraduates [9, 15, 19], only one has included self-evaluation. MacDonald et al.[19] found that medical undergraduates' self-assessed scores of simple technical manipulations became more accurate when they had developed some expertise with the skill. However, inaccuracy of self-assessment in health professionals has been well documented [20, 21]. Although increasing knowledge and experience improves accuracy of peer assessment [22], the ability to assess one's own performance remains poor. Gordon [21] found that the accuracy of health professionals' ratings of their performance over a clinical attachment did not improve with subsequent years of training. The tendency of poor performers to overrate their performances has also been noted [20]. Gordon [21] found that in the domain of factual knowledge, there was ‘a vast overconfidence in those who know little’, and suggested that self assessment is tied to stable self-concepts of general ability, and appears refractory to objective evidence or judgement by qualified observers. The insight required to recognise one's own weaknesses and gaps in learning [23] is fundamental to continual professional development and lifelong learning. Although in the current study, trainees found the simulations a valuable learning experience, this learning would be further enhanced if they could more reliably identify their own deficiencies. A number of interventions to improve self-assessment have been described. Showing a video replay of the scenario, or providing written criteria of expected actions, has not always been effective at improving self-assessment. Explicit discussion and reconciliation of the differences between the self-assessed scores and other sources of evaluative data appear to be vital components of programmes to improve self-assessment [23]. Martin et al.[22] found that showing benchmarking videos of different standards of performance may improve self-assessment. Boud [20] suggested that negotiating the criteria for assessment with the students and defining the standard against which performance is judged are effective strategies. The implications for simulation training are clear. Educational benefit will be maximal if the criteria for good and bad performance are negotiated, the expected standards and actions are explicit, and the difference between participants' views of their performances and the view of external observers is reconciled.

This study has limitations in terms of scope and numbers. It is difficult to generate large numbers of simulator assessments as, unlike established assessment methods, there is no existing pool of data and obtaining data is time consuming and expensive. Including larger numbers of candidates and cases would generate increasingly reliable estimates of the generalisability coefficient of different test formats. Numbers of trainees in this study were too small to allow subgroup analysis of performance and self-assessment at different levels of training, or correlation with other markers of performance. Face validity of the simulations was supported by trainees' responses to the questionnaire, but other aspects of validity require further study. Self-assessment could potentially be more accurate if trainees were given the same criteria for performance and underwent standard setting exercises. This would be an interesting area for future studies [22], with potential to help trainees to evaluate their performances more accurately not only in the simulator but also in clinical practice.

In conclusion, we have shown a disparity between self-assessed and externally rated scores. We have also demonstrated that it is possible to rank the performance of anaesthesia trainees reliably using patient simulation. However, large numbers of cases are required.


We would like to acknowledge the support of the Wellington Department of Anaesthesia for making it possible for the trainees to participate, and to the Wellington Anaesthesia Trust for providing funding for the study.