Evaluating the Performance of Inpatient Attending Physicians

A New Instrument for Today's Teaching Hospitals


  • Received from the Department of Medicine (CAS, ABV, ATE, BMR), Cook County Hospital and Rush Medical College, Chicago, Ill.

Address correspondence and requests for reprints to Dr. Smith: Department of Medicine, Cook County Hospital, Room 1518, 1900 W. Polk Street, Chicago, IL 60612 (e-mail: christopher_smith@rush.edu).


OBJECTIVE:  Instruments available to evaluate attending physicians fail to address their diverse roles and responsibilities in current inpatient practice. We developed a new instrument to evaluate attending physicians on medical inpatient services and tested its reliability and validity.

DESIGN:  Analysis of 731 evaluations of 99 attending physicians over a 1-year period.

SETTING:  Internal medicine residency program at a university-affiliated public teaching hospital.

PARTICIPANTS:  All medical residents (N= 145) and internal medicine attending physicians (N= 99) on inpatient ward rotations for the study period.

MEASUREMENTS:  A 32-item questionnaire assessed attending physician performance in 9 domains: evidence-based medicine, bedside teaching, clinical reasoning, patient-based teaching, teaching sessions, patient care, rounding, professionalism, and feedback. A summary score was calculated by averaging scores on all items.

RESULTS:  Eighty-five percent of eligible evaluations were completed and analyzed. Internal consistency among items in the summary score was 0.95 (Cronbach's α). Interrater reliability, using an average of 8 evaluations, was 0.87. The instrument discriminated among attending physicians with statistically significant differences on mean summary score and all 9 domain-specific mean scores (all comparisons, P < .001). The summary score predicted winners of faculty teaching awards (odds ratio [OR], 17; 95% confidence interval [CI], 8 to 36) and was strongly correlated with residents’ desire to work with the attending again (r = .79; 95% CI, 0.74 to 0.83). The single item that best predicted the summary score was how frequently the physician made explicit his or her clinical reasoning in making medical decisions (r2= .90).

CONCLUSION:  The new instrument provides a reliable and valid method to evaluate the performance of inpatient teaching attending physicians.

Profound changes in the practice of medicine and delivery of health care have affected the roles and responsibilities of all physicians, including trainees and their teachers.1 At most academic hospitals today, attending physicians’ responsibilities extend far beyond the teaching of clinical medicine to residents and students. They must also manage patients’ day-to-day care efficiently (sicker and quicker); incorporate the best available evidence into their clinical decisions2; exemplify professionalism3; and demonstrate mastery of bedside skills4 as well as new technology and treatments. Instruments currently used to evaluate the quality of attending physicians in teaching hospitals do not address these diverse responsibilities. Many measure didactic and professional behaviors more relevant to traditional classroom teaching than modern clinical attending.5–9 Others have not been well validated10 or provide only a single summary score to rank faculty performance,11 which does not allow specific feedback to foster faculty improvement.12–14 For these reasons, no current instrument is valid and valuable, at least at our teaching hospital. Therefore, we developed a new instrument to evaluate inpatient attending physicians and tested its validity and reliability.


We developed the instrument (available at http://www.jgim.org) with the goals of accurately representing the full range of roles and responsibilities of inpatient attending physicians and collecting information that was sufficiently specific to allow targeted efforts at faculty improvement. This project was judged to be exempt from Institutional Review Board approval at our facility.


Our setting is a university-affiliated public hospital with 145 internal medicine residents and 130 full-time internal medicine clinical faculty. Twelve admitting teams staff the hospital's internal medicine service. Teams consist of 4 residents and 1 attending physician who serves as both the teaching attending and the patients’ physician of record. Each team admits approximately 100 patients during a 4-week period. The internal medicine service accepts all medical admissions to the hospital except patients admitted to the coronary care unit, the medical intensive care unit, or the subspecialty HIV service. Most faculty serve as attending physicians on the medicine inpatient service for two 4-week periods annually, supervising a different group of 4 residents each period. We evaluated the instrument during the academic year from July 2000 to June 2001.

Instrument Development

A 7-member faculty committee, including the department chair, residency program director, and associate program directors reviewed the literature on existing instruments,5–11 clarified the important attending roles, responsibilities, and expectations within our hospital, and developed questions to evaluate their performance. Because our attending physicians simultaneously serve as the patients’ primary inpatient physician and the residents’ primary teacher, we included items that assessed, a priori, 4 main areas of interest to us—clinical supervision, teaching, feedback, and professionalism. The instrument was revised after review by other faculty and after pretesting with a sample of residents. The final instrument included 32 items, with most items using 5-point Likert response options (Appendix).

Data Collection

Each resident independently completed an anonymous, scannable questionnaire on the last day of each 4-week rotation during a mandatory 30-minute feedback session. The questionnaire took about 10 minutes to complete, and during the meeting investigators collected the instruments and conducted group interviews with each team of residents to assess faculty performance and resident experiences that might not have been captured by the questionnaire. Thus, these mandatory sessions offered an opportunity to assess the instrument's clarity and completeness. To maximize candor, we assured resident anonymity by transcribing all handwritten comments and by only presenting attending physicians with their own aggregated results at the end of the year. The confidentiality of faculty was protected by using unique numerical identifiers during data analysis and by not sharing results until their validity and reliability were verified.

Data Analysis

Our analytic goals were to group items into meaningful conceptual domains and then test the instrument's reliability and validity.

Conceptual Domains.  After data collection was completed, we evaluated whether there were distinct conceptual domains of performance within the 4 broad areas of clinical supervision, teaching, feedback, and professionalism. The criteria for defining a conceptual domain were that questions had to be conceptually similar as well as statistically related (inter-item correlations greater than 0.4). The conceptual criteria required consensus among the 7 members of the instrument committee. Items that might have been conceptually similar but did not meet our statistical threshold, such as teaching sessions and teaching pertinent to patients, were retained as single-item domains. Two items were dropped because the residents considered them ambiguous. This process allowed us to group 32 items into 9 distinct conceptual domains (Table 1). A summary score was calculated as a mean of all equally weighted items in the 9 domains.

Table 1. Domains Representing the Roles and Responsibilities of Inpatient Attending Physicians
  • Scores were adjusted for attending physicians’ contact with residents (2 or 4 weeks) and transformed to a 5-point scale.

  •  Dichotomous responses were transformed to either 1 (no) or 5 (yes).

1. Evidence-based medicine6. Patient care
 Broad knowledge of medicine (q1) Saw patients every day (q3)
 Up to date (q2) Independently evaluated each patient (q4)
 Expected questions and literature searches (q17) Reviewed daily care plan with team (q5)
 Modeled questioning and searching (q18) Contributed information and advice (q6)
 Expected incorporating best evidence (q21) Helped speak with consultants (q8)
 Modeled incorporating best evidence (q22)7. Rounding
2. Bedside teaching Effective and efficient postcall rounds (q9)
 Taught interviewing and communication (q14)8. Professionalism
 Taught physical exam skills (q15) Treated patients with respect (q27)
3. Clinical reasoning Treated residents with respect (q28)
 Expected differential diagnosis and plan (q7) Released residents for conferences (q29)
 Expected decisions (q10) Encouraged residents to call at any time (q30)
 Explicit about clinical reasoning (q16) Sensitive to social and cultural factors (q31)
 Expected a working diagnosis (q19)9. Feedback
 Modeled committing to working diagnosis (q20) Provided ongoing feedback (q23)
4. Patient-based teaching Stated expectations at beginning (q25)
 Taught about patients’ problems (q13) Provided midrotation feedback (q26)
5. Teaching sessionsSummary score
 Amount of teaching (q24)* Mean score for all items

Reliability.  We measured two types of reliability: the consistency among responses to items within each domain (Cronbach's coefficient α) and the consistency among different residents when they evaluated the same attending physician on the same domain (inter-rater reliability). The inter-rater reliability was analyzed in two ways. First, the inter-rater reliability coefficient was calculated to describe the proportion of the variation among physicians’ mean scores that can be attributed to true physician differences. Second, we estimated the precision (twice the standard error, equivalent to a 95% confidence interval [CI]) of a physician's mean domain score under three scenarios—with 1, 4, or 8 resident evaluations. This second measure of inter-rater reliability describes the range of uncertainty around any score and is measured in the units of the original 5-point scale. The two measures of inter-rater reliability provide two complementary perspectives.

Validity.  We assessed the instrument's face, content, and construct validity, but the accuracy, or criterion validity, of our instrument could not be assessed, because there is no gold standard. An instrument has face and content validity if its purpose and instructions are clear, the items are specific and unambiguous, the format and length acceptable, the gradations in response options appropriate, and if the instrument is comprehensive (includes all important variables and excludes unimportant ones). These characteristics were evaluated in two ways. First, we solicited feedback from residents through free-text responses at the end of each questionnaire and during the monthly, mandatory, end-of-rotation feedback sessions. Second, at the end of the year, we surveyed attending physicians after they had received their individual aggregate data to determine whether the results were believable and potentially useful.

To evaluate the instrument's construct validity, we conducted 4 tests to verify that it performed as expected. First, we tested whether the instrument could detect important and statistically significant differences between attending physicians on each domain. Second, we tested whether each domain contributed unique information and could thus discriminate among attending physicians even after adjusting for all other domain scores. Third, we tested whether the instrument's summary score could predict which physicians had won teaching awards over the past 4 years (determined by a vote among residents without input from the formal evaluation process). Fourth, we tested whether the instrument's summary score was correlated with a related variable, how strongly the resident wanted to work with the attending physician again (measured at the end of each questionnaire with a single question and 5 response options, from definitely no to emphatically yes).

Statistical Analysis

Analyses were performed with SPSS statistical software, version 10 (SPSS Inc., Chicago, Ill). Interrater reliability coefficients were calculated using analysis of variance with resident considered a random factor.15 Calculating the standard error for means of 1, 4, and 8 measurements assessed precision of the domain scores. Tests of construct validity included analysis of variance, multivariable linear regression, logistic regression, and Spearman correlation. Model assumptions were verified for all analyses. All P values are two-sided.


Residents completed 731 evaluations of the 99 faculty who served as inpatient attending physicians during the 12-month data collection period (85% of all eligible evaluations), for an average of 7.4 evaluations per attending (range, 3 to 16).

Conceptual Domains

Table 1 describes the items for the 9 conceptual domains.


As shown in Table 2, all multi-item domains, except feedback, demonstrated acceptable consistency among items (Cronbach's α > 0.8). For interrater reliability, an evaluation by a single resident provided an unreliable assessment of an attending physician's performance. However, if 8 residents were averaged (the number of residents an attending physician would supervise during two 4-week rotations at our hospital), the interrater reliability coefficients were high (>0.8) for 7 of the 9 domains (Table 2). For 2 domains—feedback and professionalism—the values for the reliability coefficient were relatively low (0.62 and 0.77). However, when interrater reliability was assessed using the CI (Table 3), professionalism was measured with good precision with 8 resident evaluations (95% CI ± 0.1), but feedback was measured with the least precision (CI ± 0.4).

Table 2. Reliability of the Domain Scores and Summary Score
DomainInternal Consistency (Cronbach's α)Interrater Reliability: 1 EvaluationInterrater Reliability: Average of 8 Evaluations
  • *

     A Cronbach's α cannot be calculated for single-item domains.

Evidence-based medicine0.930.460.87
Bedside teaching0.900.350.81
Clinical reasoning0.880.360.82
Patient-based teaching*0.360.82
Teaching sessions*0.320.79
Patient care0.860.430.86
Summary score0.950.450.87
Table 3. Precision of Domain Scores for 1, 4, or 8 Evaluations per Attending
Physician*Precision: ±2 × SE (equivalent to 95% CI)
Domain1 Evaluation4 Evaluations8 Evaluations
  • *

     All scores are on a scale from 1 to 5.

  •  SE = standard error of the mean.

Evidence-based medicine±0.6±0.3±0.2
Bedside teaching±0.9±0.4±0.3
Clinical reasoning±0.5±0.2±0.2
Patient-based teaching±0.6±0.3±0.1
Teaching sessions±1.0±0.5±0.3
Patient care±0.5±0.2±0.2
Summary score±0.4±0.2±0.1


Face and Content Validity.  Residents confirmed that the instrument was acceptable, sensible, and comprehensive during the 13 feedback sessions. Attending physicians verified that the results were believable and potentially useful after they had reviewed a summary of their personal scores at the end of the year.

Construct Validity.  The instrument is able to discriminate among physicians, as there were statistically significant differences among attending physicians within each domain (ANOVA, P < .001 for each domain). The greatest variation among physicians was in bedside clinical teaching (mean scores ranging from 1.1 to 5) and in teaching sessions (mean scores ranging from 1 to 5), where individual physician's mean scores spanned the full range of potential values (Fig. 1). The least variation was in professionalism (mean scores ranged from 3.8 to 5). On average, our attending physicians were rated highest on professionalism and lowest on teaching sessions (Fig. 1).

Figure 1.

Boxplots of the 99 attending physicians’ average scores by domain. The gray box represents the distribution of the middle half of physicians’ average scores; the vertical black line transecting the box represents the median value; the whiskers extend to the minimum average score and maximum average score.

Each domain contributed unique information about attending physicians, because there were statistically significant differences among physicians within each domain, even after accounting for their performance in all other domains using multivariable linear regression (for each domain, P < .005). Thus, the instrument can discriminate among the physicians’ different roles and responsibilities represented by the 9 domains. The instrument's summary score was a strong predictor of those physicians who had recently won teaching awards; for every increase of 1 point in the summary score, the odds of having won a teaching award increased 17-fold (95% CI, 8 to 36).

Finally, there was a strong correlation between the instrument's summary score and the resident's desire to work with the attending physician again (r= .79; 95% CI, 0.74 to 0.83).

Summary Score and Individual Items

Among the individual items, the strongest predictor of an attending physician's summary score was the frequency that attending physicians explicitly described their clinical reasoning when discussing clinical decisions (r2= .90). This item of explicit clinical reasoning was also the best predictor of a resident wanting to work with the attending physician again (r2= .79).


These findings strongly support the validity and reliability of our new evaluation instrument. Unlike others, this instrument evaluates attending physicians’ educational and supervisory roles when caring for medical inpatients, measuring performance in several separate domains relevant to each of those responsibilities. Such comprehensive, detailed evaluations are needed now, given current expectations that teaching hospitals set new standards for evidence-based care, patient safety, and the training of our nation's future physicians.

Assessment of the instrument's validity yielded impressive results. Each of its 9 conceptual domains discriminated among attending physicians—evaluation of different physicians varied significantly within each domain—and each domain contributed specific, unique information. Its construct validity was supported by its powerful prediction of physicians who won teaching awards and residents’ desire to work again with individual physicians. Although there is no gold standard to assess its criterion validity (accuracy), both residents and attending physicians attested to the instrument's face and content validity.

The instrument's reliability was also impressive. All domains except one (feedback) demonstrated good interrater reliability—either reliability coefficient >0.8 or CI <±0.4. These findings highlight the importance of three issues. First, it is important to obtain multiple independent evaluations before judging an attending physician's performance. Second, measuring interrater reliability in more than one way provides a more comprehensive and clearer perspective of the results (for example, professionalism). Third, they demonstrate the need to improve reliability in evaluating feedback, a domain also rated lower in several previous studies.6,7,10,16 At least in part, this problem may be perceptual: some learners seem not to “recognize” feedback even when it has been well-documented.17

After analyzing our data we recognized the limitations of some of the items and revised them. The reliability for assessing feedback was unacceptably low and there was no evaluation of end-of-rotation feedback. Our revised questionnaire now assesses whether attending physicians are providing continuous feedback and we plan to test the reliability of these revisions. End-of-rotation feedback is important and captured in another instrument when the attending physician completes the written resident evaluation and discusses it with the trainee. Other revisions to the instrument have included an assessment of the quality, as well as quantity, of the teaching sessions. We also plan to improve and test the specificity of the bedside teaching items to ensure that these measures actually assess teaching time at the bedside.

It will be important to test whether our findings are generalizable to other institutions where attendings teach and supervise inpatient care simultaneously. If so, use of the instrument would allow credible evaluation of clinical faculty in all such hospitals, help to standardize criteria for academic promotion of clinician-educators, and facilitate multicenter trials to improve attending performance. Table 3, for example, provides precise estimates of the upper and lower confidence limits for individual faculty members’ scores in each domain. Such data can provide the basis for attending-specific or department-wide improvement efforts directed at any or all of the 9 domains. These metrics also can be used to study many unresolved questions about modern inpatient teaching services: are hospitalists the answer?18,19 Do generalists outperform specialists?20 Can bedside teaching be resurrected?21,22 Can the inpatient practice of “evidence-based medicine” be promoted and improved?23–25

We note with interest domains whose ratings exhibited marked variation among attending physicians (Fig. 1). For example, evaluation of teaching sessions and bedside teaching had wide interquartile ranges, with individual physician scores spanning the full range of potential values. Although previous research illuminates some of the characteristics and practices of successful clinical teachers17,26–28 and role models,29–31 little is known about how some physicians manage to teach more than others when working on a busy inpatient service. We hope to learn more about this from our own attending physicians who received the highest ratings in these domains.

We note also the pivotal importance of the query about explicit clinical reasoning. Intended to evaluate the attending physicians’ conscious effort to make transparent to trainees the process of clinical problem solving, this one item was the dominant predictor of both the instrument's summary score and residents’ desire to work with the attending again. It may not be surprising that learners want teachers to “think out loud,” but we are unaware of previously published data about the importance of this issue to physicians-in-training. More research is needed to understand better its meaning and implications. Although we found this single item to be highly predictive of the summary score, this does not obviate the need for the other items in the questionnaire. For an instrument to provide useful feedback to academic ward attending physicians, one still requires specific questions that address each of their major roles and responsibilities.

Finally, it is important to study whether feedback of evaluations to attending physicians improves their performance, overall or in specific domains, individually and department-wide. If simple feedback were unproductive, targeted interventions in faculty development would be appropriate next steps. In fact, this same sequential strategy for evaluation and improvement makes sense in other clinical teaching venues: intensive care units, subspecialty consultation services, emergency departments, and outpatient clinics. For use in these venues, our evaluation instrument may need some site- and discipline-specific modifications. Nevertheless, in its current form, the instrument provides a valid, reliable tool to begin to understand better the complex, critical responsibilities of attending faculty in teaching hospitals today.


The authors thank Drs. Maurice Lemon, Peter Clarke, Krishna Das, and Avery Hart for their valuable contributions to the design of the instrument and the medicine residents and attending physicians of Cook County Hospital who participated in the study.

Financial support came from the Department of Medicine, Cook County Hospital.


inline image

inline image