The George Washington University) is Professor of Russian at The George Washington University, Washington, DC.
A Study of Interrater Reliability of the ACTFL Oral Proficiency Interview in Five European Languages: Data from ESL, French, German, Russian, and Spanish
Article first published online: 31 DEC 2008
© 1995 American Council on the Teaching of Foreign Languages
Foreign Language Annals
Volume 28, Issue 3, pages 407–422, October 1995
How to Cite
Thompson, I. (1995), A Study of Interrater Reliability of the ACTFL Oral Proficiency Interview in Five European Languages: Data from ESL, French, German, Russian, and Spanish. Foreign Language Annals, 28: 407–422. doi: 10.1111/j.1944-9720.1995.tb00808.x
- Issue published online: 31 DEC 2008
- Article first published online: 31 DEC 2008
- Cited By
ABSTRACT The widespread use of the Oral Proficiency Interview (OPI) throughout the government, the academic community, and increasingly the business world, calls for an extensive program of research concerning theoretical and practical issues associated with the assessment of speaking proficiency in general, and the use of the OPI in particular. The present study, based on 795 double-rated oral proficiency interviews, was designed to consider the following questions: (1) What is the interrater reliability of ACTFL-certified testers in five European languages: ESL, French, German, Russian, and Spanish? (2) What is the relationship between interviewer-assigned ratings and second ratings based on audio replay of the interviews? (3) Does interrater reliability vary as a function of proficiency level? (4) Do different languages exhibit different patterns of interrater agreement across levels? (5) Are interrater disagreements confined mostly to the same main proficiency level? With regard to the above questions, results show: (1) Interrater reliability for all languages in this study was significant both when Pearson's r and Cohen's modified kappa were used. (2) When second-raters disagreed with interviewer-assigned ratings, they were three times as likely to assign scores that were lower rather than higher. (3) Some levels of performance are harder to rate than others. (4) The five languages exhibited different patterns of interrater agreement across levels. (5) Crossing of major borders was very frequent, and was dependent on the proficiency level. As a result of these findings, several practical steps are suggested in order to improve interrater reliability.