Assessing Patient Care: Summary of the Breakout Group on Assessment of Observable Learner Performance


  • The list of breakout session participants can be found as the appendix of a related article on page 1486.
  • The paper reports on a breakout track of the Academic Emergency Medicine consensus conference “Education Research In Emergency Medicine: Opportunities, Challenges, and Strategies for Success” held May 9, 2012, in Chicago, IL.
  • The authors have no relevant financial information or potential conflicts of interest to disclose.

Address for correspondence and reprints: James Kimo Takayesu, MD, MS; e-mail:


There is an established expectation that physicians in training demonstrate competence in all aspects of clinical care prior to entering professional practice. Multiple methods have been used to assess competence in patient care, including direct observation, simulation-based assessments, objective structured clinical examinations (OSCEs), global faculty evaluations, 360-degree evaluations, portfolios, self-reflection, clinical performance metrics, and procedure logs. A thorough assessment of competence in patient care requires a mixture of methods, taking into account each method's costs, benefits, and current level of evidence. At the 2012 Academic Emergency Medicine (AEM) consensus conference on educational research, one breakout group reviewed and discussed the evidence supporting various methods of assessing patient care and defined a research agenda for the continued development of specific assessment methods based on current best practices. In this article, the authors review each method's supporting reliability and validity evidence and make specific recommendations for future educational research.

In 2001, the Accreditation Council for Graduate Medical Education (ACGME) introduced a timeline for the implementation of training and assessment in six core competencies that form the foundation of clinical competence. Introduced in 1996, the Canadian CanMEDS manager competency correlates to the ACGME patient care competency, broadly defined as “the active engagement in decision-making in the operation of the healthcare system.”[1] The patient care competency for emergency medicine (EM) has been defined by a previous Academic Emergency Medicine (AEM) consensus conference,[2] now further elaborated on by the milestones in training,[3] as being able to efficiently gather and synthesize medical and diagnostic information, prioritize tasks, and implement management plans on multiple patients, as well as performing essential invasive procedures competently.

There is an explicit expectation that physicians in training demonstrate competence in various aspects of clinical care prior to graduation and professional practice.[4] While this accountability falls squarely on the shoulders of residency training programs, it is mirrored by commensurate expectations of maintenance of competency during ongoing professional practice.

The goals of the 2012 AEM consensus conference patient care working group were to describe the current state of evidence for assessment of competence in patient care and define a research agenda for the further development of specific assessment methods based on current best practices.


A search was conducted using MEDLINE 1996-present using the key word search terms “assessment,” “patient care,” “competency,” “competence,” “assess*,” “emergency,” and “education” and limited to humans and English language [boolean search: ((assessment and patient care AND (competency or competence)) OR (assess* and emergency and education) resulting in 3493 hits; (patient care and competency) and assessment resulting in 282 references]. These searches were combined with the additional search terms “resident* or medical student*” (58,880) resulting in 414 and 267 final results, respectively. After reviewing for relevance, 76 articles remained. Additional references were identified from review of these results and are included when relevant. These articles were used as a foundation for the breakout group's discussion.


Overview of Assessment Methods Identified

Multiple methods have been used to assess competence in patient care, including direct observation, simulation-based assessments, objective structured clinical examinations (OSCEs), global faculty evaluations, 360-degree evaluations, portfolios, self-reflection, clinical performance metrics, and procedure logs. Demonstration of competency in patient care at successive levels of training and across multiple clinical scenarios requires several overlapping methods to ensure validity of assessment. A survey of Canadian residency programs demonstrated an average of 1.75 assessment methods.[5] Of equal importance is the frequency and consistency of formative assessment,[6] its integration into the educational curriculum, and the “catalytic effect”[7] of assessment results and feedback on improving individual performance. The selection of assessment methods will be based at least in part on the availability of financial, faculty, and learning resources within a residency. Each method and its supporting evidence for validity and reliability will be discussed individually.

Direct Observation

Direct observation allows the learner to be observed in the clinical setting. It allows faculty to provide formative feedback to the learner in real time[8-11] and tends to generate more specific feedback and constructive comments compared to global assessments.[12, 13] At least 55 direct observation tools have been developed, but only a few have proven reliability, validity, or educational outcomes data measured.[14]

Faculty training on the use of any direct observation tool is important given the potential for variability of interpretation of a clinical encounter and the tool’s language, yet few studies have demonstrated more than cursory observer training.[14] There is evidence, however, that even without extensive training, certain tools have good to excellent reliability.[10, 15] The correlation between direct observation and other measures of competency such as written test scores,[16-25] OSCEs, or standardized patient assessments[18-21, 25, 26] has been studied in a number of specialties showing modest correlation supporting the validity of certain direct observation methods. Internal medicine has produced many studies of direct observation, the strongest of which is the mini–clinical evaluation exercise (mini-CEX) assessment tool having robust evidence for its validity and reliability.[11, 15] Other specialties such as physical medicine and rehabilitation have developed similar tools for clinical assessment.[27] The EM Standardized Direct Observation Assessment Tool (SDOT) has been shown to have good inter-rater reliability when residents were observed via videotaped interactions,[10] and in real-time clinical practice, if liberal agreement criteria were used.[28] Learners report improved satisfaction[9] and perceive a positive effect on their clinical care with direct observation assessment.[14] Demonstration of a change in the delivery or quality of patient care is rare in direct observation; more often, improvements are in learner or observer self-assessed modification of attitudes, knowledge, or skills.[14]

Although faculty generally like direct observation as an assessment method,[8, 9, 27, 29] adding this responsibility to existing faculty requirements of direct patient care, supervision, and bedside teaching may seem burdensome. A few EM residency programs have used nonclinical faculty to perform direct observation[8, 29] with reported success; however, this may not be financially practical for many programs. Another concern is the fact that certain patient care encounters such as resuscitations require faculty supervision and direct participation, limiting the ability to perform direct observation. One solution can be videotaping resuscitations for delayed review and debriefing,[30] although technical and legal (HIPAA) barriers exist.


Simulation has the advantage of using standardized scenarios that can be designed to assess specific skills and global patient care without risk to patients. When paired with directed feedback, simulation assessments have demonstrated long-term retention of certain skills at 1.5 years.[31] Scenarios and their assessment rubrics must be both designed in a standardized format that permits dissemination and tested for their reliability and evidence of validity.[32, 33] When high-fidelity simulation (HFS) and the core competencies were first introduced, assessment tools were unvalidated and considered too blunt to provide more than formative assessment[34]; as assessment design becomes increasingly reliable and valid, using simulation-based assessment (SBA) as a summative, or high-stakes, measurement of competency is an important area for further research.[35]

Learners can be assessed with both checklists (e.g. time to action, critical actions performed) and global performance ratings, with different information gleaned from each, all potentially having good discriminatory power[30, 35, 36] and a combination being most useful.[37] Since patient care requires a broad skill set and knowledge base, multiple scenarios are needed to provide a valid assessment of overall patient care competency and to distinguish between performance at different levels of training. Murray et al.[38] demonstrated that 12 scenarios were needed in a study of residents and attending physicians in anesthesia and six in another study comparing student certified registered nurse anesthetists to senior/junior anesthesiologists.[36]

A HFS assessment rating tool should demonstrate both interobserver reliability and evidence of validity by demonstrating improved performance at higher levels of training. Assessments have demonstrated validity for both medical students[39, 40] and residents.[39-43] A set of four pediatric advanced life support scenarios demonstrated good inter-rater reliability and higher scores for more senior pediatric residents, but suggested that multiple scenarios are needed to provide a valid assessment.[44] Improved clinical performance has been demonstrated in advanced cardiac life support (ACLS) using a checklist SBA with high reliability and internal consistency in an internal medicine residency program.[45] One study assessing interns from multiple specialties managing two cardiac scenarios showed a surprising decrease in scores after the clinical experience of intern year, raising questions regarding the assessment's validity.[46] This highlights the importance of assessment rubrics reflecting the clinical skills and cognition that map to real-world competent patient care rather than rubrics directed at stratifying learner performance with items that may penalize more experienced learners who may skip steps.[39] The fact that experts may often use shortcuts to arrive at diagnostic conclusions[47] requires careful design of rubrics that do not overlook more advanced levels of performance.

The evidence for validity of HFS assessments when compared to other forms of assessment is limited. Gordon et al.[42] demonstrated validity of HFS assessment when compared to OSCE. One EM residency program designed a well-received simulation curriculum that found most learners to be competent, but did not translate to an increase in written test scores,[48] which highlights the need to design HFS and other methods of assessment around the educational outcomes the assessment is intended to measure.[37]

The ultimate evidence of validity is comparison to actual patient outcomes or subsequent improvements in patient care, but this has been infrequently measured.[49] Internal medicine residents receiving simulation ACLS training performed better than more senior residents with traditional training based on chart reviews of their resuscitations,[50] albeit limited by the fact that assessment of the intervention group was also closer in time to their ACLS training. Internal medicine residents demonstrated improved airway management skills in both the simulation laboratory and at the patient's bedside when scored by checklist after HFS training.[51] This training was achievable whether senior residents or faculty were training PGY-1 residents.[52] Pediatric EM and gastroenterology attending physicians performed better on a procedural sedation checklist after HFS training and assessment,[53] demonstrating that this effect is not limited to novices.

Simulation-based team training (SBTT) research is limited but shows promise in enhancing the more complex skills of team management and crisis resource management, as well as improving outcomes in simulated scenarios.[54] When added to traditional didactic teaching, simulation training has been shown to improve teamwork among members of emergency department (ED) staff.[55] Scoring systems such as the Ottawa Global Rating Scale demonstrate reliability and validity for assessing leadership, communication, and resource management.[56, 57]

High-fidelity simulation is resource-intensive, historically requiring faculty observer presence to assess individual learners during sessions. This workload has limited the widespread use of simulation-based assessment.[35] Video assessments would allow multiple assessments of one learner's performance without requiring all faculty members to be present during the simulation session. Williams et al.[58] have demonstrated that assessment of videotaped sessions have comparable inter-rater reliability when compared to real-time assessment.


Objective structured clinical examinations are routinely used to evaluate multiple ACGME core competencies and are particularly useful for those that involve direct patient contact (data gathering, assimilation of data, and patient management). Published data indicate that EM educators have used OSCEs to assess multiple patient care competencies using a variety of clinical scenarios.[59] The American Board of Emergency Medicine (ABEM) oral examination format has been adapted to include assessments of core competencies into the critical actions of oral examinations based on changes to the Model of the Clinical Practice of Emergency Medicine.[60]

OSCEs have also been used to assess specific patient care tasks within EM, such as death disclosure[61] and intimate partner violence counseling.[62] While OSCEs have limited use in procedural training, standardized patients have been used for noninvasive nonpainful procedural training and assessment such as ultrasound. OSCEs have been used to evaluate ultrasonography of the abdominal aorta,[63] as well as the completion of the Focused Assessment with Sonography in Trauma examination.[59] In many of these circumstances the OSCE is used to evaluate the effectiveness of an educational intervention, either through comparison of pre- and posttesting or through comparison of study and control groups.

The reliability of OSCE assessment has been demonstrated through interobserver agreement[64, 65] and internal consistency.[66] Quest et al.[61] demonstrated good correlation of faculty and standardized patient ratings of resident performance; however, there was poor correlation between resident self-assessment and both faculty and standardized patient ratings, raising the question of the reliability of self-assessment using an OSCE format. The oral examination format used by ABEM has demonstrated an interexaminer agreement of 97% on critical actions and 95% on performance ratings.[65]

Validity evidence has been demonstrated through comparison to other measures, such as the mini-CEX,[26] improvement with increasing levels of training,[66-71] global evaluations,[72, 73] in-training examination scores,[74] and core competency-based evaluations of patient care, medical knowledge, and practice-based learning.[73] Wallenstein et al.[73] demonstrated that scores on an acute-care OSCE for PGY-1 residents correlated with global ratings of patient care and overall clinical performance at 18 months of training.

Global Assessment

Global assessments have been the most commonly used method to meet the ACGME requirement of biannual resident performance review,[2, 75] anchored by specific terminology derived from the core competencies[76] and most recently the EM milestones.[3] Global assessments are subject to recall bias, response bias, and the subjectivity of non-clinical factors such as the halo or millstone effects.[2] Faculty vary in their performance assessments, even when observing the same clinical encounter.[77] When anchored to specific criteria such as the core competencies, global assessments demonstrate reasonable reliability and evidence of validity.[24, 78] They have shown correlation with other measures of competence such as surgical in-training examination scores.[25] Thus, inclusion of specific assessment items that delineate the desired behaviors, skills, and actions is essential to reducing subjectivity[22, 78] and increasing internal consistency.

The reporter-interpreter-manager-educator (RIME) framework used in internal medicine clerkships is an assessment tool that has demonstrated excellent reliability and validity when compared to other measures such as U.S. Medical Licensing Examination (USMLE) scores and medical school grade point average.[79] Ander et al.[80] have demonstrated the validity of the RIME assessment tool for medical students when compared to standard multi-item global evaluations. One anesthesia residency program has developed a global assessment system that is completed on a biweekly basis throughout training. Over a period of 2 years, 14,000 evaluations were collected yielding data that could be normalized across individual faculty raters resulting in a “z-score” that demonstrated a very high degree of reliability and validity in predicting resident performance and the need for remediation.[78]

360-degree Evaluations

Although the 360-degree evaluation can involve anyone the learner comes in contact with during his or her professional duties,[81] it has most commonly been studied with nursing assessments[82, 83] and patient assessments.[84, 85] Resident professionalism and interactions with nurses improved in an EM residency after instituting nursing evaluation of the residents.[82] A study of practicing internists found nursing evaluations to be a useful measure of nonclinical skills.[83] When measuring clinical skills, the same group found that peer ratings required at least 11 items to be accurate.[86] Individual practice improvement after receiving 360-degree evaluation feedback varies due to both environmental factors such as clinical workload, the hospital management culture, and individual factors such as self-efficacy and motivation.[87] This suggests that awareness of 360-degree data may not be enough to influence behavioral change and improve outcomes in the patient care competency. Although patients value the clinical skills of residents involved in their care,[84] they may view clinical skills less favorably when not satisfied with resident care[88] regardless of the actual quality of care provided. Given the limited definition of patient care as previously defined by King et al.,[2] patient assessments would appear more applicable for the assessment of other core competencies.[84, 85, 88, 89]


To date there are no published studies on the reliability and validity of resident portfolios in EM to assess patient care competency. While resident satisfaction with the use of a learning portfolio in a general surgery training program was high, there was poor interobserver agreement on the assessment of the portfolio entries’ quality.[90] While the authors do not describe the submitted portfolio entries in detail, the template focuses more generally on differential diagnosis, diagnostic studies, and management options, rather than detail of operative procedures. Chart review can yield potentially valuable data on patient care, but may suffer from the confounding effects of collaboration with faculty as the chart is created. O'Sullivan et al.[91] present a model of chart review including appropriateness of history and physical documentation, orders, and additional supporting materials such as assessments by supervising physicians regarding the case presentation and resident efficiency in the ED. A subsequent study by the same primary author in psychiatry demonstrated the reliability of portfolio reviews when assessed using two to three reviewers. Validity was shown with respect to medical knowledge and level of training, but surprisingly not clinical performance.[92]

A Best Evidence Medical Education (BEME) systematic review on the educational effects of portfolios on undergraduate student learning was conducted in 2009.[93] Of the 69 studies analyzed, only about a quarter met the minimum selected quality indicators, and only 13% reported changes in student skills and attitudes. While noting a trend of improving study quality in more recent analyses, the strength and extent of the evidence for the educational effects of portfolios is limited mostly to learner participation, rather than a measureable educational effect. These effects center around self-reflection, self-awareness, and medical knowledge,[93] rather than the patient care competency as previously defined.[2]

Reflection and Self-assessment

While self-assessment shows limited reliability and evidence of validity for professionalism and communication skills,[94] there is a lack of evidence to support its use in the high-stakes realm of physician competence in patient care. A systematic review in 2006 identified 17 studies comparing self-assessment to one or more external objective measures, such as OSCEs, simulation, examination performance, and supervisor evaluation (three studies used two external measures for a total of 20 comparisons).[95] Of the 20 comparisons, 13 demonstrated little, no, or an inverse relationship between self-assessment and objective external assessment. Among the remaining seven demonstrating an overall positive association, wide variability or methodologic errors were identified.[95]

More recent analyses have also failed to demonstrate a strong correlation between self-assessment and independent assessors. A general surgery training program compared resident self-assessment to external evaluation by peers, nurses, and attending physicians. In all comparisons, residents overestimated their global performance regardless of their specific performance level.[96] Residents underestimated their performance in specific competencies including patient care. Residents in the upper quartile of performance underestimated their performance in additional specific competencies, whereas residents in the lowest performance quartile overestimated professionalism skills. A similar study in anesthesia residents demonstrated moderate correlation between self- and observer assessments when reviewing their performance on three emergency HFS scenarios; however, this correlation was poorer at the lower levels of performance,[97] further supporting the unreliability of self-assessment for patient care competence.


Clinical metrics derived from chart review or patient care information systems can be useful in assessing an individual's performance as measured by patients per hour, relative value units (RVUs), or other clinical care measures (e.g., patient acuity, resource utilization),[6, 98] When linked to systematic and ongoing feedback, assessment of clinical metrics can lead to long-term clinical practice change.[6] While there is evidence that certain measures such as RVUs/hour correlate with individual cognitive assessments of multitasking ability,[99] they potentially suffer from a lack of specificity given the resident's inherent inability to practice independently because of his or her supervised role. The measure is more a reflection of the combined performance of the resident and supervising faculty than the resident in isolation. Rather than assessing the quality of an individual patient care encounter, metrics are better suited to assess a resident's ability, on average across multiple encounters, to complete management plans and disposition patients expediently.

Procedure Performance Assessment

Invasive procedural skills are an essential component of resident training. There is ample evidence that there are significant gaps in medical student and resident procedural competence,[100-103] as well as variability in the correct and safe performance of procedures among residents when performing procedures on patients.[104] There is strong evidence supporting the need for audit and feedback after teaching procedural skills such as central venous catheter insertion to ensure a prolonged and profound behavioral change.[105] The fact that explicit assessment of technical skills occurs in as few as 15% of some procedure-oriented residencies[75] highlights the need for improved training and structured assessment prior to direct patient care.

While paper or electronic procedure logs may keep track of a resident's cumulative experience, they do not involve direct observation and feedback on specific psychomotor skills by faculty or other certified trainers. Procedural competence has been assessed using multiple methods, including direct observation during patient care,[30, 106] cadaveric models,[107, 108] animal models,[109] simulated environments,[110] simulated task trainers,[41, 48, 111-115] objective structured assessments,[74] and procedure logs.[116] A recent meta-analysis of simulation-based medical educational methods demonstrated a consistency of results favoring simulation over traditional clinical educational methods.[117] The validity evidence is very strong for simulation procedural training as demonstrated by the real-world clinical effect of reducing infections[118] and complications[119] related to central venous line placement after simulation training,[118, 119] supporting the use of simulation methods for procedural skill competency assessment. As with direct observation and HFS assessments, rubrics with demonstrated evidence of validity and inter-rater reliability are essential to ensuring the quality of these assessments.[106, 113, 120] Once validated, these rubrics can be used by nonclinical raters, decreasing the resource intensity of the assessment.[121]


Consensus Recommendations

A holistic assessment of competence in patient care requires a mixture of methods rather than any single method of assessment, taking into account each method's costs, benefits, and current level of evidence (see Table 1). Assessments should focus on specific behaviors, tasks, and skills, with opportunities for formative feedback and repeated performance,[47, 122] enabling formative feedback to drive learner growth.[122] The assessment rubric should undergo rigorous testing of its reliability and evidence of validity by comparing its results to actual patient care and patient outcomes. Follow-up assessment is important to ensure durability of competence, which can influence curricular changes in the timing, structure, or repetition of educational interventions throughout residency training (see Figure 1). A variety of assessment methods is necessary to accommodate local variations in access to high-cost technologies such as HFS.

Table 1. Summary of Methods
Assessment MethodStrengthsWeaknessesRelative Cost (Excluding Faculty Time)Highest Level of Evidence of Outcomesa
  1. OCSE = objective structured clinical examination.

  2. a

    Outcomes were rated using a modified Kirkpatrick hierarchy wherein levels of impact are as follows: 1 = participation (learners’ or observers’ views on the tool or its implementation); 2 = learner or observer self-assessed modification of attitudes, knowledge, or skills; 3 = transfer of learning (objectively measured change in learner or observer knowledge or skills); and 4 = results (change in organizational delivery or quality of patient care).


No risk

Wide range of scenarios/resuscitations


Suspension of disbelief

Direct observationActual patient careVariable scenarios/resuscitations$-$$2–3
360-degree evaluationsMultiple sources for observationsPotential for participation bias, halo and millstone effects$1
OSCE, oral examination, standardized patients

No risk

High fidelity


Smaller range of scenarios/resuscitations$–$$$2
PortfoliosLearner-drivenCollection of reflections and work outputs rather than actual patient care1
Self-reflectionReflective practicePoor correlation at lower performance levels1
Global assessmentModerate validity when anchoredPotential for participation bias, halo and millstone effects$NA
MetricsReflect actual measure of clinical practiceLimited by dependence on supervising faculty$–$$1
Figure 1.

Agenda for developing, validating, and implementing assessments. PGY = postgraduate year.

Direct observation, OSCE, and HFS have the strongest evidence as valid and reliable assessment methods. Global assessments and 360-degree evaluations require specific behavioral anchors to increase their validity and large response rates to control for confounders such as the halo/millstone effects and individual rater variability. Metrics can provide valuable performance data for residents in their more senior years, since these measures can be directly compared to attending physician performance standards. Portfolios and self-reflection lack evidence to support their use as stand-alone assessments of patient care, but have the benefit of encouraging the reflective and learner-directed practice that forms the basis of continuing medical education.

Research Agenda

  • Determine the number of direct observation assessments and types of patient encounters (e.g., critical diagnoses, chief complaints, diagnostic complexity) that are needed to provide a valid reflection of patient care competence for an individual resident.
  • Design and codify a process to create reliable and valid simulation, objective structured clinical, and oral examination assessments that use checklists (time to event or critical action) and global ratings to assess competence in ways that reflect expert clinical practice (which may use shortcuts) rather than simply the accomplishment of basic task lists.
  • Determine the number of global assessments needed to compose a valid assessment of a resident's patient care competence accounting for the known biases of this method.
  • Assess the validity and relevance of nonclinician evaluations in patient care competence given the influence of potential confounders.
  • Determine the validity of clinical metrics relative to other more-studied forms of assessment with good reliability and validity such as direct observation, OSCE, and simulation.
  • Develop standardized training programs and assessments for procedural skill acquisition (such as those for central line insertion), starting with no-risk methods such as simulated, cadaveric, or OSCE experiences and concluding with direct observation assessment during actual patient care and correlation to complications and patient outcomes.