- Top of page
- LITERATURE REVIEW
Student Evaluations of Instruction (SEIs) from about 6,000 sections over 4 years representing over 100,000 students at the college of business at a large public university are analyzed, to study the impact of noninstructional factors on student ratings. Administrative factors like semester, time of day, location, and instructor attributes like gender and rank are studied. The combined impact of all the noninstructional factors studied is statistically significant. Our study has practical implications for administrators who use SEIs to evaluate faculty performance. SEI scores reflect some inherent biases due to noninstructional factors. Appropriate norming procedures can compensate for such biases, ensuring fair evaluations.
- Top of page
- LITERATURE REVIEW
Student Evaluations of Instruction (SEIs) are now commonplace among universities as a key mechanism for getting feedback regarding teaching practices. According to Seldin (1993), 86% of U.S. colleges and universities use SEIs to make key decisions about faculty. These SEIs also form a key component of evaluations of faculty teaching performance by the administration, and impact promotion and tenure decisions. As such, there is always a debate about the validity and appropriate use of these instruments. Brightman (2005) has argued that to be useful, an instrument must first be valid, and norming procedures must be in place to aid comparative interpretation of the data. Norming requires the identification of systematic biases in the ratings of overall instructor effectiveness (OIE) due to noninstructional factors.
A clear understanding of the impact of nonteaching-related factors is necessary to ensure fair evaluation of faculty. For example, if a factor like class size significantly affects overall ratings on an SEI for an instructor, then there should be a norming process used by administrators which compensates for class size differences when evaluating faculty. Researchers have examined the impact of various factors on SEI results to look for systematic biases in various fields, from psychology (Greenwald, 1997) to economics (Isley & Singh, 2005) and business (Isley & Singh, 2007; Liaw & Goh, 2003; Peterson, Berenson, Misra, & Radosevich, 2008). The nonteaching-related factors can be classified as student related, instructor related, course related, and administrative or situational (Peterson et al., 2008; Pounder, 2007). Student-related factors include the initial motivation of the student for the subject, grade expectation, grade point average, and gender. Instructor-related factors include the instructor's rank and gender, whereas course characteristics include type of course (qualitative vs. quantitative, core vs. noncore) and course level (graduate vs. undergraduate). Administrative factors influencing SEI ratings include class size, location, classroom and equipment, and time of day.
Some researchers believe that student grade expectations are positively correlated with SEI ratings (Zangenehzadeh, 1988), whereas others argue the opposite (Marsh & Roche, 2000). Centra (2003) analyzed more than 50,000 college courses controlling for class size, teaching method, and student perceived learning outcomes in the course. Learning outcomes turned out to have a large positive effect on SEIs. After controlling for learning outcomes, expected grades did not affect student evaluations.
Studies on teaching innovations demonstrate that a good innovation leads to improved student motivation and engagement, resulting in better student performance (Bergquist & Maggs, 2011; Snider & Eliasson 2013). Better student performance is in turn positively correlated with higher instructor effectiveness ratings (Davis, 2009). It is therefore plausible that improved teaching results in an increase in grade expectations as well as better student evaluation of teaching effectiveness.
The focus of this article is on the impact of noninstructional factors on student evaluations. We therefore exclude grade expectation from the study, since it is sufficiently intertwined with teaching ability to be considered a noninstructional factor.
While many researchers have been examining the impact of nonteaching-related factors on instructor ratings in different disciplines, there is a need to conduct integrative studies to look for consistent patterns across universities and disciplines, or examine the differences as they appear. The noninstructional factors, especially administrative ones, are likely to be different in each institution, and a fair evaluation requires examination of the data at various institutions. This study focuses on SEIs from the College of Business at a large research university spanning across 4 years and 10 different departments.
We examine the following key research question:
Do the noninstructional factors (such as course type and level, instructor rank and gender, semester, time of day) have a significant effect on the OIE ratings?
If these factors are significant, and if the impact is large enough, they should be used for norming purposes when comparing faculty performances. The rest of the article is organized into the following sections: literature review, methodology, discussion of results, and reflections.
- Top of page
- LITERATURE REVIEW
There is a debate in the literature about the validity of using SEIs for assessment of teaching. As some researchers argue, the goal of teaching is to improve student learning. Therefore, the learning must be measured, not the intervention. However, according to recent surveys of research on SEIs, most variables that correlate with student ratings of instruction are also related to instructional effectiveness and student learning (Benton & Cashin, 2012). Benton, Douchon, and Pallett (2013) found self-ratings of student learning to be positively correlated with student performance. Students who rate instructors higher also perform better on exams, and are better able to apply course material and show greater interest in pursuing the subject in later years (Davis, 2009).
One question goes beyond the validity of the instrument to ask if there are systematic biases due to factors that are extraneous to the student evaluation instrument. Scriven (2011) argues that an evaluation instrument must be credible as well as valid, with credibility referring to the audience's estimate of the validity. He states
“… evaluation design must sometimes involve considerations that go beyond validity. This must not be viewed as pandering to prejudice, but as of the essence of certification, of accountability, in a more general sense of the educational and social obligations of the evaluations. (“It is not enough that justice be done, it must also be the case that it must be seen that justice is done.”).”
In the context of higher education, norming of teaching effectiveness scores obtained from SEIs is the way to ensure that justice is done (and seen to be done) in evaluating faculty. If there are factors that bias the teaching effectiveness scores, then such biases must be compensated for. The factors causing such biases can be broadly categorized as course related, instructor related, and administrative (Feldman, 2007; Peterson et al., 2008; Pounder, 2007).
Davies, Hirshberg, Lye, Johnson, and McDonald (2007) studied the impact of several noninstructional factors on instructor ratings in a study of undergraduates in Australia. They found course-related factors such as the quantitative nature of a subject to have a significant effect. Costin, Greenough, and Menges (1971) studied ratings by class designation and found instructors receiving higher ratings from seniors than from freshmen. It could be because better instructors are selected to teach higher level classes, indicating a selection bias of sorts. It could also be because the poorer students drop out in the first couple of years, and better students make it to the senior year, which also affects instructor ratings.
Peterson et al. (2008) find the senior-level students giving better ratings than sophomores and also better ratings than students taking graduate courses. Given that the 400- or senior-level courses are (a) in the discipline concentration, (b) student-selected electives, or (c) the required business capstone, one possible explanation for their significantly better student evaluations is what might be termed a “familiarity effect.” Students become more familiar with the professors from whom they have taken earlier classes and therefore have reduced anxiety.
Student ability and initial liking for the subject have an impact on instructor ratings (Aigner & Thum, 1986). Courses aimed at students of high ability get higher ratings, and those aimed at students with low ability get lower ratings. Some of that may translate to noncore classes getting higher ratings, since those courses are selected by students that presumably believe that they have some ability in that subject. Feldman (2007) found that students in major courses rated instructors higher than students in nonmajor courses. Also, students in elective courses rated instructors higher than those in required courses. Expecting ratings for graduate courses to be higher than undergraduate, and noncore higher than core, Brightman, Elliott, and Bhada (1993) used four categories—undergraduate core (UC), undergraduate noncore (UN), graduate core (GC), and graduate noncore (GN)—based on course level (undergraduate, graduate) and course type (core, noncore) to norm SEI data.
Gender differences in performance evaluations in various fields have been studied extensively in the literature (Arvey, 1979; Dobbins, Cardy, & Truxillo, 1988; Mobley, 1982). Most of the studies of gender differences regarding SEIs have focused on the gender of the instructor rather than the student. Positive characteristics of stereotypical men include rationality, competence, and assertiveness, whereas for women warmth and expressiveness were seen as the main positive traits (Del Boca & Ashmore, 1980). Sprague and Massoni (2005) argue that the burden on female instructors is more labor-intensive, since the interpersonal relationship with students cannot be carried over from one semester to the next. Table 1 summarizes the conflicting findings regarding the ratings of male and female instructors.
Table 1. Gender differences in student ratings
|Rated higher than male instructors||Centra (2009)—attributed to reasons other than bias.|
| ||Feldman (1993)—rated higher by female students.|
|Rated lower than male instructors||Lackritz (2004)|
| ||Heckert, Latier, Ringwald, & Silvey (2006)|
| ||Tatro (1995)|
| ||Mohan (2011)|
|No gender difference found||Bauer & Baltes (2002)|
| ||Blackhart, Peruche, DeWall, & Joiner (2006)|
| ||Centra & Gaubatz (2000)|
| ||Reid (2010)|
| ||Hancock, Shannon, & Trentham (1993)|
| ||Kohn & Hatfield (2006)|
Among the instructors’ attributes that potentially influence the ratings are the instructors’ positions or ranks, how demanding they are perceived to be, as well as experience, training, communication skills, and age (Blackburn & Lawrence, 1986). Isley and Singh (2007) found that while higher expected grades result in more favorable student evaluations, this relationship is significantly different depending upon faculty rank. Adjunct faculty ratings are most affected by student grade expectations, followed by tenured faculty, and lastly by tenure track (TT) faculty. Mohan (2011) also reports that nontenure track (NTT) faculty get higher ratings than TT faculty, although the effect can be altered, she argues, by inflating grades. Peterson et al. (2008) did not find any difference in ratings received by full-time faculty versus ratings received by adjunct faculty. Feldman (2007) reports higher ratings for higher ranked faculty compared with those of lower ranked faculty.
Several researchers have documented an absence of relationship between class timing and student ratings of instruction (Aleamoni, 1981; Benton & Cashin, 2012; Feldman, 1978). However, Peterson et al. (2008) found better ratings for daytime classes than evening classes. They attribute the finding to either higher expectation from students who work during day and taking evening classes, or to these students resenting being given homework that adds to their several preoccupations. They also found no evidence of any difference between spring and fall semester ratings.
Some classes are taught in modern facilities with stadium seating, spacious rooms, ports for student laptops, Internet connections, while others are still taught in fairly old, cramped rooms with students on chairs with a large arm on which to write. Anecdotal data suggest that there might be a relationship between the quality of classroom facilities and the ratings of instruction. No research has looked into this aspect.
There is some evidence in the literature indicating a relationship between class size and student ratings, with lower class sizes yielding higher ratings (Feldman, 1984, 2007; Isley & Singh, 2007; Liaw & Goh, 2003). For class sizes under 80, there is a relatively steep price to be paid for each additional student in terms of loss of ratings (Bedard & Kuhn, 2008). The difference in ratings per additional student is not so great in larger class sizes (80–150 students). On the other hand, some research finds U-shaped ratings with small and large class sizes yielding higher ratings than class sizes in between, due to a selection bias where teachers known to be good are assigned the really large classes (Marsh, Overall, & Kesler, 1979; Wood, Linsky, & Straus, 1974). In general, instructors believe smaller class sizes are easier to engage, and therefore result in higher ratings.
- Top of page
- LITERATURE REVIEW
We collected data on all student evaluations filled out between 2005 and 2009 in the college of business at a large public university. About 6,000 sections of various courses were taught during this period at the undergraduate and graduate levels. Table 2 shows the number of sections taught in each year, segmented into four categories based on course type and course level—GN, GC, UN, and UC.
Table 2. Number of sections taught in the business school by year and by category
|Year||GN||GC||Grad Total||UN||UC||UG Total||Grand Total|
Data from four academic years starting 2005–2006 and ending with 2008–2009 was analyzed. Roughly 1,450 sections were offered every year, with about a third of them being graduate classes. PhD classes were eliminated from our analysis, since they tend to be very small in size, and sufficiently different from typical undergraduate or graduate courses. The average enrollment per section was 28.36, and the average number of responses to the SEIs per section was 18.20. The response rate for the SEIs overall across the 4-year span was roughly 64%, which is par for most universities. Richardson (2005) surveyed the literature on student evaluation instruments, and indicates that response rates of around 60% are common and that a 70% response rate would be considered good. Table 3 shows the number of student responses to the SEIs by year and by category.
Table 3. Number of responses to the seis by year and by category
|Year||GN||GC||Grad Total||UN||UC||UG Total||Grand Total|
The SEI instrument used at this college is a modified version of one developed and originally validated at UC Berkeley. The modified version was validated at this college over 20 years ago by Brightman, Bhada, Elliott, and Vandenberg (1989). More recently, Nargundkar and Shrikhande (2012) found the instrument to still be valid. The instrument consists of 33 question items pertaining to various teaching related factors, and question 34 addresses the OIE. In this study, we use the OIE ratings (based on a five-point Likert scale, along with information regarding the noninstructional factors. The noninstructional factors are listed in Table 4 along with the possible values for each of them.
Table 4. Noninstructional factors used in the study
|Semester||Fall, Spring, Summer|
|Time of day||Morning (starting before noon)|
| ||Afternoon (starting before 4:30 pm)|
| ||Early Evening (starting before 7:00 pm)|
|Course Type and level||Graduate noncore (GN)|
| ||Graduate core (GC)|
| ||Undergraduate noncore (UN)|
| ||Undergraduate core (UC)|
|Instructor gender||Female, Male|
| ||Nontenure track (NTT)|
| ||Part time instructor (PTI)|
| ||Graduate teaching assistant (GTA)|
| ||Tenure track (TT)|
| ||Classroom south|
| ||General classroom building|
| ||Sparks Hall|
|Class size||Numeric variable with the number enrolled|
Dummy variables were created to indicate various subgroups for time of day, location, rank, gender, course type and course level, and a regression analysis performed with the OIE score as the dependent variable, and the dummies as well as the class size as the independent variables.
The current norming process at our college involves using four segments initially proposed by Brightman et al. (1993)—UC, UN, GC, and GN. The impact of various noninstructional factors was therefore analyzed individually, within each of the four segments. Average scores for OIE for each noninstructional factor within all four segments were compared using two-sample t-tests and ANOVAs. The variances in the subgroups were not significantly different, making the use of t-tests and ANOVA appropriate. Where ANOVAs were significant, Tukey's two-way comparisons helped to determine specific differences among subgroups.
- Top of page
- LITERATURE REVIEW
In order to examine the impact of all the nonteaching factors taken together on the overall rating of instruction, a regression was performed on the entire data set. OIE score was used as the dependent variable, and dummy variables were created for the categorical independent variables to represent the semester, time of day, location, course level and course type, instructor rank, instructor gender, and class size. Table 5 shows the final model with the significant variables.
Table 5. Regression of Q34 on noninstructional factors. Highlighting is to show groups of dummies for a given variable together
|Adjusted R square||.0390964|| || || |
|Standard error||.5276773|| || || |
|Observations||5,996|| || || |
| ||Coefficients||Standard Error||t-Stat||p-Value|
As seen above, overall ratings for summer and spring are significantly higher than for fall, summer ratings being the highest. Similarly time of day seems to matter, with each of the three times shown scoring less than the evening classes, with afternoon classes scoring the least. Core classes in general score lower than noncore, with GC scoring the least. Differences in faculty rank were also significant, with NTT faculty scoring the highest and graduate teaching assistants the lowest.
Given the significance of all these factors in the presence of the others, we examine each noninstructional factor separately, as has been done by various researchers.
Course Type and Level
Table 6 shows the results of a two-sample t-test for the mean OIE scores (Likert scale, 1 = low, 5 = high) for core and noncore classes.
Table 6. OIE ratings by type (Core vs. NC) overall
|Course Type|| |
| ||n = 2,490|
| ||n = 3334|
| ||p < .001|
Table 7 shows the results of a two-sample t-test for the mean OIE scores (Likert scale, 1 = low, 5 = high) for graduate and undergraduate classes.
Table 7. OIE ratings by Level (Grad vs. UG) overall
|Course Level|| |
| ||n = 2,165|
| ||n = 3,659|
| ||p < .01|
In both cases, there was a significant difference. Ratings for noncore classes were significantly higher than those for core classes, while graduate classes got higher ratings than undergraduate classes, consistent with expectations. Based on the above findings as well as Brightman et al. (1993) results, four segments were created based on the combination of course level and the course type dimensions, rather than looking at each dimension independently. The results are shown in Table 8.
Table 8. OIE ratings by segment—course level and type combined
| ||Undergrad||Graduate|| |
|Core||4.228||4.260||p > .10|
| ||n = 1,668||n = 822|| |
|Noncore||4.301||4.349||p < .05|
| ||n = 1,991||n = 1,343|| |
| ||p < .001||p < .001|| |
Looking at the rows in the table, the ratings are not significantly different for UC and GC classes. Among noncore classes, however, ratings for graduate classes are significantly higher than for undergraduate classes. Looking at the columns in the table, ratings for noncore classes are higher than core classes in both the undergraduate and graduate segments. These findings are a little different from those in the regression analysis, which controls for all other factors.
Instructor Gender and Rank
Table 9 summarizes our findings regarding instructor gender within each of the four segments.
Table 9. OIE ratings by instructor gender by segment
|Core|| || |
|Female||4.237 (n = 929)||4.285 (n = 217)|
|Male||4.217 (n = 719)||4.243 (n = 572)|
| ||p > .10||p > .10|
|Noncore|| || |
|Female||4.355 (n = 688)||4.286 (n = 244)|
|Male||4.278 (n = 1,273)||4.365 (n = 1,086)|
| ||p < .01||p < .05|
For the core segment, no significant differences were found between male and female instructors. For the noncore segment, the ratings for female instructors were higher than for male instructors among undergraduate students, while the reverse was true among graduate students. There was no difference between the male and female instructor ratings when all four segments were combined.
Table 10 summarizes the results of OIE ratings by faculty rank.
Table 10. OIE ratings by faculty rank within each segment
| ||Undergrad|| ||Graduate|| |
|Core||1.Tenured||4.32 (n = 134)||1. NTT||4.36 (n = 332)|
| ||2. NTT||4.28 (n = 703)||2. Tenured||4.26 (n = 248)|
| ||3. GTA||4.25 (n = 322)||3. TT||4.14 (n = 55)|
| ||4. PTI||4.19 (n = 381)||4. PTI||4.04 (n = 144)|
| ||5. TT||4.15 (n = 27)|| || |
| ||1,2 > 3,4,5 and 3 > 5||p < .05||1 > 3,4 and 2 > 4||p < .05|
|Noncore||1. NTT||4.35 (n = 618)||1. NTT||4.41 (n = 362)|
| ||2. PTI||4.31 (n = 341)||2. Tenured||4.38 (n = 628)|
| ||3. TT||4.28 (n = 166)||3. PTI||4.20 (n = 150)|
| ||4. Tenured||4.25 (n = 547)||4. TT||4.13 (n = 144)|
| ||5. GTA||4.15 (n = 149)|| || |
| ||1 > 4,5 and 2 > 5||p < .05||1,2 > 3,4||p < .05|
In each of the four segments, the ANOVA was significant at p < .001 overall, meaning that the scores for all faculty status groups were not equal; there were some differences somewhere. Tukey's two-way comparisons showed the specific differences as shown in Table 10. For instance, for the UC segment, “1,2 > 3,4,5” means that the first two groups (Tenured and NTT) were not different from each other, but each of them was significantly better than groups 3, 4, and 5 (GTA, PTI, and TT). Furthermore, “3 > 5” means that group 3 (GTA) was significantly better than group 5 (TT).
Semester, Time, and Class Size
Overall ratings in the regression were found to be significantly higher during summer compared to spring, and likewise significantly higher for spring compared to fall. Examining the impact of semester within the four segments, we found the following results (Table 11):
Table 11. OIE ratings by semester for each of the four segments
| ||Undergrad|| || ||Graduate|| || |
|Core||Summer||4.337||n = 345||Summer||4.326||n = 184|
| ||Spring||4.212||n = 671||Spring||4.244||n = 283|
| ||Fall||4.188||n = 652||Fall||4.240||n = 355|
| ||Summer > Spring, Fall; p < .05|| || ||p < .05|
|Noncore||Summer||4.397||n = 464|| || || |
| ||Spring||4.312||n = 795|| || || |
| ||Fall||4.229||n = 732|| || || |
| ||Summer > Spring > Fall,||Summer > Fall, p < .05|
| ||p < .05|| |
Among UC classes, summer ratings were significantly higher than for spring and fall. There was, however, no significant difference in ratings for core graduate classes, perhaps due to the lower sample size in that category. Among UN classes, summer ratings were significantly higher than for spring, which were significantly higher than for fall. For GN classes, summer ratings were significantly higher than for fall, but ratings for spring were not significantly different from either fall or summer.
To test for differences in ratings for sections taught at various times during the day, the day was divided into four time segments. Classes that began before noon were in the “Morning” group; those that began at or after noon but before 4:30 pm were classified as “Afternoon”; those that began at 4:30 pm but before 7:15 pm were classified as “Early Evening,” while those that started at 7:15 pm or later were the “Evening” classes. The results are shown in Table 12.
Table 12. OIE ratings by time of day by segment
| ||Undergrad|| ||Graduate|| |
|Core||1. Afternoon||4.2260 (n = 338)||1. Morning||4.4117 (n = 184)|
| ||2. Morning||4.2229 (n = 675)||2. Afternoon||4.3332 (n = 31)|
| ||3. Early Evening||4.2123 (n = 300)||3. Evening||4.2305 (n = 291)|
| ||4. Evening||4.2229 (n = 355)||4. Early Evening||4.1844 (n = 303)|
| ||p > .10||p < .001; Pairwise: 1 > 3,4|
|Noncore||1. Morning||4.3479 (n = 340)||1. Evening||4.3947 (n = 656)|
| ||2. Early Evening||4.3019 (n = 569)||2. Morning||4.3413 (n = 85)|
| ||3. Evening||4.2908 (n = 339)||3. Afternoon||4.3160 (n = 53)|
| ||4. Afternoon||4.2239 (n = 630)||4. Early Evening||4.2992 (n = 549)|
| ||p < .05; Pairwise: 1,2 > 4||p < .05; Pairwise: 1 > 4|
The results are mixed. UC classes show no difference overall, whereas undergrad noncore do better in the morning and early evenings. GC classes score better in the mornings, while GN classes (which are mostly taught early evening or evening) score better in the evening compared to early evening. There was no difference in overall ratings between the four times of day when all four segments were combined.
Finally, a scatter plot of OIE ratings versus class size is shown in Figure 1.
It is difficult to discern a relationship between the two variables from the plot, given the high density of points. The only visible pattern seems to be a slightly downward trend among the very large class sizes (over 100).
The average class size was 28.36. We tested for differences in ratings between class sizes of 30 and below with class sizes over 30. Table 13 shows the results.
Table 13. OIE ratings and class size
| ||Class Size ≤ 30||Class Size > 30|
|Sample size (number of sections)||3,596||2,400|
| || ||p < .001|
The overall ratings for the smaller class sizes were significantly higher than for the larger ones.
- Top of page
- LITERATURE REVIEW
Instructor ratings are significantly different for course-related factors like the course level and type. Ratings are higher for noncore classes compared to core classes. This is consistent with our expectations based on the literature. It seems to be fairly well established that initial liking for a course does in fact affect the ratings of an instructor. Graduate classes overall get better ratings than undergraduate classes. Graduate students are generally expected to be better prepared and have a greater liking for the subject than undergraduates. Among core classes, there is no difference in ratings for undergraduate and graduate classes. However, among noncore classes, there is a difference between the two.
Among core classes, there is no significant difference in ratings between male and female instructors. However, we see an interesting effect in the noncore classes. Undergraduate students rated female instructors higher than male instructors, whereas graduate students rated male instructors higher than female instructors. Younger students may prefer the nurturing characteristics attributed to female instructors. Similarly, the older graduate students perhaps prefer the perceived stereotypical qualities among male instructors of being forceful and goal driven.
Instructor rank or status also has an impact on overall ratings. In all four segments, NTT instructors consistently show higher ratings than untenured TT faculty. However, tenured faculty performed very well, especially in graduate classes. Among undergraduate classes, part time instructors (PTIs) have better ratings than untenured TT faculty. In our opinion, this finding is consistent with the incentive structure in place for faculty at research institutions. NTT faculty is primarily evaluated on teaching effectiveness, whereas TT faculty is evaluated primarily on research, with lower emphasis on teaching. However, when they do get tenure, the emphasis on research is reduced, giving them time to focus on teaching.
The influence of administrative factors like semester, time of day, and location (classroom quality) on overall ratings of instructors was mixed. Summer semester ratings are consistently higher than the ratings for spring or fall, with being GC classes being the only exception. Summer classes on average have around 20–25 students, whereas fall and spring classes have 30+ students on average. The regression analysis shows the effect of the semester to be significant even after controlling for the class size effect. An explanation for better summer ratings may be that students take fewer classes during summer, allowing greater focus on those classes. Furthermore, frequent meetings during summer may build a better rapport with the instructor and better retention of material.
As for time of day, the regression shows a progression of rating differences, with instructors being rated the highest for evening classes, followed by morning, early evening, and afternoon classes, respectively. When the effect of timing was examined by itself for each of the four segments, we find some differences. Within the GC, morning classes receive a higher rating than evening, and not many classes are offered in the afternoon. Also, many of these morning courses are offered on Saturdays, when the graduate students are relatively free from work-related pressures. Within the UC, morning and early evening classes scored higher than afternoon classes, consistent with our expectation based on tiredness/sleepiness after lunch. Finally, in the GN, evening classes score higher than early evening (there are very few classes taught in the morning or afternoon). This is also consistent with our expectations. After a long day at work, the students are typically tired for the early evening class, but get a second wind post dinner for the evening classes. None of the classroom location variables came in significant in the regression. In other words, location (and by proxy, classroom quality) did not affect OIE ratings.
Class size effect on OIE ratings is consistent with recent literature. Smaller class sizes have significantly higher ratings than larger ones. We first tested class sizes under 30 against 30+, since it was close to the overall average class size of a little over 28. To see if there was a hint of a U-shaped relationship as indicated by Wood et al. (1974), three groupings of class size—less than 20, 21–40, and 40+ were also tested. The results were unidirectional, with larger classes getting lower ratings on average.
- Top of page
- LITERATURE REVIEW
As Brightman (2005) points out, in order to effectively use SEIs for assessment, the instrument must first be valid. The validity of the instrument used at the College of Business of this large public university was established by Brightman et al. (1989) and the instrument was revalidated in recent times by Nargundkar and Shrikhande (2012). Furthermore, the results of the SEIs should be appropriately normed for fair feedback to faculty. In other words, the impact of noninstructional factors on overall ratings of instruction must be controlled for in evaluating faculty. Noninstructional factors are by definition not relevant to one's teaching ability or effectiveness, and are beyond the instructor's control. However, these factors have the ability to bias an instructor's effectiveness ratings, as shown in this article. This has a major implication for administrators evaluating faculty.
Based on our findings, administrators should look at various noninstructional factors when assessing faculty performance through student evaluations. At our business school, the four segments currently used for norming (UC, UN, GC, GN) by administrators are appropriate, given the results of this study. However, this study suggests that they are insufficient, and that several additional factors, namely, semester, time of day, instructor gender and rank, and class size also need to be considered. Based on our regression model, an instructor with an average score of 4.37 who happens to hit upon an adverse combination of these factors can in the worst case end up with a score of 4.05, while an instructor who hits upon the best combination of these factors can end up with a score of 4.57. In other words, two instructors with identical teaching effectiveness could get overall student ratings that differ by as much as .52 on a scale of 1–5. Given that most SEI ratings vary between 3.0 and 5.0 (a range of 2.0), a difference of .52 due to extraneous factors can be drastic. This implies that an administrator's perception of an instructor's effectiveness has the potential to be distorted to a significant degree by noninstructional factors beyond the instructor's control.
For other colleges, the implication of our study is that norming is essential, and administrators at each college must identify the noninstructional factors most relevant to norming in their institutional setting. Such a study is worth doing at every college that uses SEIs to evaluate faculty. The noninstructional factors we identified as significantly impacting student ratings of instruction may be specific to our institution alone.
Recent research (Benton & Cashin, 2012) suggested that it is a misconception to attribute poor overall ratings to such noninstructional factors. Our results suggest that while noninstructional factors cannot entirely explain poor (or good) ratings, they do have the potential to bias the ratings sufficiently to matter in administrative decisions. Peterson et al. (2008) in their study of a single department within a business school suggest the possibility that instructors may try to game the system by using noninstructional factors to improve their ratings without necessarily improving teaching effectiveness. Appropriate norming procedures can eliminate this problem.
Although our study suggests ways to mitigate the distortions caused by noninstructional factors on teaching effectiveness ratings, student evaluations are by no means the only measure of teaching effectiveness and student learning. Many researchers provide ways of guarding against potential bias in SEIs (Baldwin & Blattner, 2003). Using alternative approaches such as portfolios, peer feedback sessions, and informal student surveys in addition to SEIs can further help to combat or circumvent these potential biases. Scriven (2011) suggests three models for teacher evaluation in increasing order of desirability. First, a self-assessment by faculty members; second, student evaluation of instructors reported to administrators (the method most commonly adopted); and third, an external examiner evaluating student achievement and thereby inferring the efficacy of the teacher.
Overall, the debate in the literature tends to either extol the virtues of SEIs or denigrate them as useless. Our research shows that SEIs can be useful instruments as long as they are validated, and the biases that affect them are accounted for in the evaluation process.