ABSTRACT  The widespread use of the Oral Proficiency Interview (OPI) throughout the government, the academic community, and increasingly the business world, calls for an extensive program of research concerning theoretical and practical issues associated with the assessment of speaking proficiency in general, and the use of the OPI in particular. The present study, based on 795 double-rated oral proficiency interviews, was designed to consider the following questions: (1) What is the interrater reliability of ACTFL-certified testers in five European languages: ESL, French, German, Russian, and Spanish? (2) What is the relationship between interviewer-assigned ratings and second ratings based on audio replay of the interviews? (3) Does interrater reliability vary as a function of proficiency level? (4) Do different languages exhibit different patterns of interrater agreement across levels? (5) Are interrater disagreements confined mostly to the same main proficiency level? With regard to the above questions, results show: (1) Interrater reliability for all languages in this study was significant both when Pearson's r and Cohen's modified kappa were used. (2) When second-raters disagreed with interviewer-assigned ratings, they were three times as likely to assign scores that were lower rather than higher. (3) Some levels of performance are harder to rate than others. (4) The five languages exhibited different patterns of interrater agreement across levels. (5) Crossing of major borders was very frequent, and was dependent on the proficiency level. As a result of these findings, several practical steps are suggested in order to improve interrater reliability.