Presented at the 2011 Society for Academic Emergency Medicine annual meeting, Boston, MA, June 2011.
Synchronous Collection of Multisource Feedback Evaluations Does Not Increase Inter-rater Reliability
Article first published online: 14 OCT 2011
© 2011 by the Society for Academic Emergency Medicine
Academic Emergency Medicine
Special Issue: CORD/CDEM Educational Advances Supplement
Volume 18, Issue Supplement s2, pages S65–S70, October 2011
How to Cite
Garra, G. and Thode, H. (2011), Synchronous Collection of Multisource Feedback Evaluations Does Not Increase Inter-rater Reliability. Academic Emergency Medicine, 18: S65–S70. doi: 10.1111/j.1553-2712.2011.01162.x
The authors have no relevant financial information or potential conflicts of interest to disclose.
Supervising Editor: Terry Kowalenko, MD.
- Issue published online: 14 OCT 2011
- Article first published online: 14 OCT 2011
- Received April 11, 2011; revision received June 21, 2011; accepted June 28, 2011.
ACADEMIC EMERGENCY MEDICINE 2011; 18:S65–S70 © 2011 by the Society for Academic Emergency Medicine
Objectives: Most multisource feedback (MSF) evaluations are performed asynchronously, with raters reflecting on the subject’s behavior. Numerous studies have demonstrated poor inter-rater reliability of MSF. This may be due to cognitive biases that are inherent in such a process. We sought to determine if within- and between-rater group reliability is increased when evaluations are gathered synchronously and relate to a specific patient interaction.
Methods: This was a survey at a university emergency department (ED) of 30 emergency medicine (EM) residents. ED nurses and faculty anonymously participated in asynchronous MSF assessment of resident performance from February to April 2010 using a Web-based survey, the Emergency Medicine Humanism Scale (EM-HS). In May 2010, a second round of MSF collection was conducted in the ED. At the conclusion of patient encounters, the EM-HS was synchronously obtained from ED nurses and faculty. Evaluators were instructed to assess the resident based on the patient encounter, placing aside any preconceptions of resident performance, attitude, or behavior. Evaluators rated resident performance using a 1–9 scale (“needs improvement” to “outstanding”). The mean rating for each of the questions and the total score provided by each evaluator class was calculated for each EM resident. Differences between the asynchronous and synchronous ratings were compared with t-tests. Pearson correlations were used to measure agreement in scores within and between nurse and faculty rater groups. Correlations > 0.70 were deemed acceptable and are reported with 95% confidence intervals (CIs).
Results: Twenty-one of 30 residents had assessments collected by both asynchronous and synchronous methods. A total of 699 Web-based (asynchronous) assessments were completed by nurses and 149 by faculty. Synchronous nurse and faculty assessments were obtained in 105 resident–patient encounters. There was no difference in faculty ratings between the MSF collection methods. Nurses assigned slightly (but significantly) higher ratings during synchronous collection. Correlation of the total MSF score between asynchronous and synchronous feedback collection methods within the faculty rater group was poor (0.18, 95% CI = −0.22 to 0.60). Correlation of the total MSF score between asynchronous and synchronous feedback collection methods within nurse rater groups was moderate (0.63, 95% CI = 0.27 to 0.83). Correlations between faculty–nurse rater groups for the total MSF collected asynchronously and synchronously were moderate (0.39, 95% CI = −0.05 to 0.7; and 0.44, 95% CI 0.01 to 0.73, respectively).
Conclusions: Synchronous collection of MSF did not provide clinically different EM-HS scores within rater groups and did not result in improved correlations. Our small, single-center study supports asynchronous collection of MSF.
The traditional approach to resident assessment relies upon remote memory for performance grading.1 Evaluations completed in this fashion have been shown to reflect a subjective, general impression of clinical performance and professional behavior.2 The time lag between observation of clinical performance and rating has been shown to result in erroneous feedback.3 Details of resident performance may be lost, anchored to a recent event or inaccurately represented, resulting in a negative or positive rating bias.
Multisource feedback (MSF) is a process for assessing attitudes and behavior in medical and nonmedical professions. MSF is typically based upon remote memory with rater groups completing the evaluation forms at different points in time. This asynchronous collection of feedback is subject to all of the aforementioned biases, as well as numerous cognitive biases.4,5 It is possible that the method of assessment collection along with cognitive biases may be the source of poor inter-rater correlation reported in the medical literature.
Our objective was to determine if there is a difference in feedback based on the method of collection. We hypothesized that MSF collected from nurses and faculty that are linked to a specific interaction and collected immediately (synchronous collection) would yield higher within- and between-rater group correlations compared to assessments that are performed in a reflective manner (asynchronous collection).
This was a prospective assessment of an MSF tool that qualified for waiver of informed consent by our institutional review board.
Study Setting and Population
The study was conducted at a suburban, university-based emergency department (ED) that employs 100 registered nurses and 28 full-time faculty physicians and sponsors a PGY 1–3 training program in emergency medicine (EM). The residency program is accredited for 10 residents in each year.
Survey Content and Administration
A previously reported instrument assessing EM resident interpersonal skills, attitudes, and behaviors, the Emergency Medicine Humanism Scale (EM-HS), was used to obtain feedback. The EM-HS is a MSF instrument that can be reliably administered to patients, nonphysician staff, and supervising faculty physicians.6 The EM-HS consists of nine questions for health care providers (Figure 1) with ratings on a nine-point continuum from “needs improvement” to “outstanding.” The EM-HS has previously demonstrated excellent generalizability coefficients within nurse and faculty rater groups (Eρ2 = 0.83 and 0.79, respectively). Nurse and faculty evaluators have participated in annual MSF sessions dating back to 2008 and were not provided with additional training on the assessment tool and response scale.
Asynchronous Data Collection. For 3 consecutive months in 2010, the EM-HS was distributed to all full-time EM nurses and all faculty emergency physicians, via an electronic survey (http://www.surveymonkey.com, Survey Monkey, Palo Alto, CA). The survey was open for 21 days each month and programmed for anonymous assessment of an entire residency training class. Evaluations of EM-3 residents were distributed in February, EM-2 residents in March, and EM-1 residents in April. Each week, an e-mail reminder was sent via the ED faculty and ED nurses e-mail lists encouraging participation in the MSF evaluation. Evaluators were informed that the feedback would be included in the resident performance portfolio and reviewed/discussed with each resident.
Synchronous Data Collection. In May 2010, a second round of MSF was conducted in the clinical setting. Research associates were trained to monitor a convenience sample of resident–patient interactions, which served as the substrate for nurse and faculty MSF evaluations. Research associates targeted resident–patient interactions in which medical care was provided, in its entirety, by a single EM resident. At the conclusion of the ED encounter (disposition), research associates approached the nurse and faculty physician to obtain feedback on the specific resident–patient interaction. The nurse and faculty physician providing care for the indexed patient were approached for completion of a paper copy of the EM-HS. Nurses and faculty were instructed to base their ratings on the specific resident–patient interaction and not upon other interactions or impressions. Nurse and faculty raters were blinded to each other’s responses. Raters did not provide identifying information or sign the evaluation forms. Question 1 (“Ability to cooperate with medical colleagues”) was eliminated from the paper-based synchronous survey based upon prior nurse rater feedback. Patient feedback, which is provided to residents as a component of our MSF program, was not included in the data analysis of this study.
All data were entered into SPSS v.18 (IBM SPSS Inc., Armonk, NY). The average of each question (items 2 through 9) and the total scores provided by each evaluator class were calculated for each resident. Question and total scores are reported as means with standard deviations (SDs). Comparisons of the mean question responses and total evaluation score per rater group were performed using paired-sample t-tests, using 95% confidence intervals (CIs) to determine statistical significance. All analyses were adjusted for clustering using generalized estimating equation methods. Pearson correlations (r) were used to assess agreement within and between nurse and faculty rater group evaluations using resident as the unit of analysis. Correlations > 0.70 (with 95% CI) were deemed acceptable. Data were assessed for normality using skewness and kurtosis measures.
Twenty-one of the 30 residents had MSF assessments completed in both asynchronous and synchronous conditions: eight EM-3 residents, six EM-2 residents, and seven EM-1 residents. There were a total of 699 Web-based (asynchronous) evaluations submitted from nurses and 149 from faculty. The mean numbers of asynchronous nurse and faculty evaluations per resident was 37 (SD ± 15) and 7 (SD ± 2), respectively. Synchronous assessments by both nurse and faculty physician were completed in 105 resident–patient encounters. The mean number of synchronous evaluations per resident was 5 (SD ± 3).
The mean ratings and SDs for each EM-HS item and total score obtained asynchronously and synchronously are listed in Table 1. The mean synchronous ratings provided by faculty were similar to the asynchronous ratings for every item. The mean synchronous ratings, including total score, provided by nurses were slightly (but significantly) higher than the asynchronous ratings.
|Synchronous||Asynchronous||Difference (95% CI)||Synchronous||Asynchronous||Difference (95% CI)|
|Q2||8.0 (0.8)||7.7 (0.6)||−0.3 (−0.8 to 0.1)||7.8 (1.1)||7.1 (0.7)||−0.6 (−0.9 to −0.3)|
|Q3||8.0 (0.8)||7.7 (0.6)||−0.4 (−0.8 to 0.1)||7.7 (1.0)||7.1 (0.7)||−0.6 (−0.8 to −0.3)|
|Q4||8.0 (0.8)||7.6 (0.6)||−0.3 (−0.7 to 0.1)||7.7 (1.0)||7.1 (0.7)||−0.6 (−0.8 to −0.4)|
|Q5||7.9 (0.9)||7.6 (0.7)||−0.4 (−0.8 to 0.1)||7.7 (1.1)||7.0 (0.7)||−0.7 (−1 to −0.3)|
|Q6||7.8 (0.9)||7.5 (0.7)||−0.2 (−0.7 to 0.3)||7.7 (1.2)||7.0 (0.7)||−0.7 (−1 to −0.3)|
|Q7||7.8 (0.9)||7.6 (0.7)||−0.2 (−0.7 to 0.2)||7.8 (0.9)||7.0 (0.7)||−0.8 (−1 to −0.4)|
|Q8||7.9 (0.8)||7.4 (0.8)||−0.4 (−0.9 to 0.1)||7.7 (1.0)||7.0 (0.8)||−0.8 (−1 to −0.5)|
|Q9||7.9 (0.9)||7.4 (0.7)||−0.5 (−1 to 0.1)||7.6 (1.4)||6.9 (0.8)||−0.7 (−1 to −0.3)|
|Total score||63.4 (6.4)||60.9 (5.2)||−2.4 (−6 to 1)||61.8 (8.6)||56.3 (5.8)||−5.4 (−8 to −2.9)|
Correlations were analyzed on the 21 residents for whom both synchronous and asynchronous evaluations were available (Table 2). Nearly all correlations were less than 0.50, and none achieved the predefined acceptable level of 0.70.
|Faculty Synchronous With Faculty Asynchronous||Nurse Synchronous With Nurse Asynchronous||Faculty Synchronous With Nurse Synchronous||Faculty Asynchronous With Nurse Asynchronous|
|Q2||0.24 (−0.21 to 0.61)||0.45 (0.02 to 0.74)||0.45 (0.02 to 0.74)||0.47 (0.05 to 0.75)|
|Q3||0.24 (−0.21 to 0.61)||0.46 (0.04 to 0.74)||0.41 (−0.03 to 0.72)||0.41 (−0.03 to 0.72)|
|Q4||0.24 (−0.21 to 0.61)||0.45 (0.02 to 0.74)||0.31 (−0.14 to 0.65)||0.32 (−0.13 to 0.66)|
|Q5||0.34 (−0.11 to 0.67)||0.47 (0.05 to 0.75)||0.38 (−0.06 to 0.7)||0.35 (−0.1 to 0.68)|
|Q6||0.19 (−0.26 to 0.57)||0.49 (0.07 to 0.76)||0.48 (0.06 to 0.76)||0.36 (−0.08 to 0.69)|
|Q7||0.27 (−0.18 to 0.63)||0.43 (−0.002 to 0.73)||0.57 (0.18 to 0.8)||0.28 (−0.17 to 0.63)|
|Q8||0.2 (−0.25 to 0.58)||0.44 (0.01 to 0.73)||0.49 (0.07 to 0.76)||0.37 (−0.07 to 0.66)|
|Q9||0.19 (−0.26 to 0.57)||0.5 (0.09 to 0.77)||0.16 (−0.29 to 0.55)||0.32 (−0.13 to 0.66)|
|Total score||0.18 (−0.22 to 0.60)||0.63 (0.27 to 0.83)||0.44 (0.01 to 0.73)||0.39 (−0.05 to 0.7)|
Our results demonstrate a few interesting findings. Overall, nurse ratings were always lower than corresponding faculty assessments. Similar findings have been demonstrated in other MSF studies of resident performance.7,8 Nurse and faculty assigned slightly higher ratings for every EM-HS question when assessments were linked to specific patient interactions and collected in real time. The difference between synchronous and asynchronous scores among nurse raters was greater and statistically significant. However, the difference in MSF ratings provided by nurses does not appear to have enough clinical relevance to argue in support of real-time assessment collection, as suggested by other studies on global resident evaluation.
Our data also demonstrated only moderate correlation between faculty and nurse rater groups, reconfirming the utility of MSF. Correlations were poorest between asynchronous and synchronous faculty feedback. There are a few possibilities that may explain our findings.
First, it is possible that synchronous collection of performance evaluation does indeed eliminate “halo effect.” Preconceptions about prior performance or previous experiences are known to influence judgment about a person or situation.9 The halo effect occurs when a rater does not differentiate among distinct items in an assessment, but evaluates according to global or overall judgment.10 We instructed raters to provide assessments solely on the basis of the indexed patient interaction. However, a study which focused on forewarning and introspection instructions demonstrated no effect on assessment of a scripted interview.11 Gordon12 reported that impression management techniques have little influence on performance ratings. Despite our efforts to encourage ratings that were based solely on the index patient, it is possible that recent positive or negative experiences may have affected evaluation responses.
Second, assessments of past performances are subject to forgetting and memory distortion. There is a sequential order of cognitive processes that takes place during performance assessments; behaviors are recognized, organized, and integrated with previous observations and placed into storage for retrieval at a later time.13 Heneman3 demonstrated that performance assessments were less accurate when ratings were reported 1 to 3 weeks after observation and when a small amount of information is observed. A study on the effects of maintaining a performance assessment diary demonstrated that raters who kept a diary had better performance recall than raters who did not maintain a diary, even when they were not allowed to refer to the diary.14 It is very likely that time plays a critical role in the complex cognitive processes required to complete performance ratings.13
Third, contextual noise prohibits accurate assessment of resident performance. Clinical raters typically have a multitude of responsibilities, resulting in a fragmentary view of an unsystematic sample of clinical situations with limited observation of resident–patient interaction.2 Obviously, it is difficult to determine which of the feedback results (synchronous or asynchronous) represents the true behavior that was displayed. Validity studies of MSF tools are inherently difficult. A study to determine such validity would require other raters to provide simultaneous assessments of the resident–patient interaction.
Fourth, items may be too broad and nonspecific for use in specific clinical situations. Schwartz15 reported that question wording, context, and format can have a dramatic effect on collected data. Kaiser and Craig16 demonstrated that syntax, “multi-barreledness,” and the degree of abstractional behavior in surveys items may cause raters to attach different meanings, resulting in different interpretations and discrepant ratings of the same target. Although the EM-HS is based on a feedback tool that had been previously trialed and validated,17 it is possible that the phrasing has different meaning to our group of raters.
Anecdotally, it is far more challenging to collect evaluations from the nurses and faculty in real time. Our research associates targeted 255 resident–patient interactions during the study period, but were successful in acquiring completed forms from both nurse and faculty evaluations in less than half. Time pressures and distractions decrease the amount of observation time and decrease the amount of clinical performance sampling in the competencies of interest.2 In our study, raters reported that patient care responsibilities frequently interfered with the time necessary to complete an MSF evaluation. Furthermore, raters did not have the opportunity to observe every item on the EM-HS for every resident–patient encounter, resulting in incomplete assessments. Nurses and faculty reported that it was far easier to complete evaluations on-line, during nonclinical time. Our data suggest that MSF results are essentially the same regardless of the method used for collection (synchronous or asynchronous). Given the resources required for collection of real-time feedback, we would suggest that traditional asynchronous collection methods of MSF yield results that are equivalent and easier to accomplish.
This was a single-center study of EM residents, conducted in a suburban tertiary care facility. Therefore, application to other settings or specialty programs may be constrained. We did not train nurses and faculty on the evaluation form or grading scale prior to implementation. Experts on MSF suggest that raters and subjects should have an understanding of the purpose and use of MSF for success of the program.18 Training of raters is important for obtaining accurate feedback and educating about forms of bias such as halo effect and leniency. However, our program has used the EM-HS for MSF of resident professionalism since 2008. The majority of faculty and nursing staff were familiar with the instrument.
Our prior study on MSF of EM residents demonstrated that 11 faculty evaluations and 22 nurses are required to provide stable generalizable estimates for the EM-HS. Although it was our intent to obtain a minimum of 10 evaluations per resident, we had difficulty coordinating and collecting real-time assessments from both nurses and faculty. A total of 255 resident–patient interactions were targeted for synchronous feedback collection. However, our research assistants were able to obtain completed surveys from both nurse and faculty in only 105. In our experience, a full-time research associate program is necessary to collect the requisite number of evaluations from patients to provide stable generalizable estimates.6 Synchronizing collection of nurse and faculty feedback will increase the amount of time and resources necessary to collect an appropriate sample size.
Response range bias (the tendency to use a narrow range of responses) was common for both synchronous and asynchronous collection methods. Further analysis of our data demonstrated that faculty provided the same rating for each question in 49.5% of synchronous assessments and 51.3% of asynchronous assessments. Nurses provided the same rating for each question in 48.6% of synchronous assessments and 56.5% of asynchronous assessments. This finding may reflect rater indecision, indifference, or lack of interest. It may also suggest validity issues with the questionnaire or rating scale. Further investigations are required to elucidate this finding.
Synchronous collection of multisource feedback does not provide meaningfully different ratings than assessments that are collected asynchronously. Inter-rater correlations are not improved by a synchronous assessment method.
The authors acknowledge Anna Domingo and the Stony Brook Academic Associate Program for their hard work in collecting data vital to the project.
- 6Feasibility and reliability of multisource feedback in emergency medicine residents. J Grad Med Ed. 2011; 3(3):356–60., , .
- 16Building a better mousetrap: item characteristics associated with rating discrepancies in 36-degree feedback. Consult Psychol J Pract Res. 2005; 57:235–45., .