Evaluating an Automatically Scorable, Open-Ended Response Type for Measuring Mathematical Reasoning in Computer-Adaptive Tests


  • This research was partially funded by the Graduate Record Examinations Board. The opinions expressed in this article are not necessarily those of the sponsor.

  • We gratefully acknowledge the key contributions of Ken Berger, Dave Bostain, Daryl Ezzo, Jutta Levin, Alex Vasilev, and the Mathematical Reasoning Team in developing the ME response type, writing ME items, and collecting and analyzing the response data.

  • Requests for reprints should be sent to the first author.

RANDY ELLIOT BENNETT is Principal Research Scientist, Educational Testing Service, Rosedale Road, Princeton, NJ 08541; rbennett@ets.org. Degrees: BA, SUNY at Stony Brook; MA, EdM, EdD, Teachers College, Columbia University. Specialization: new modes of assessment.

MANFRED STEFFEN is Principal Measurement Specialist, Educational Testing Service, Mail Stop 13-L, Princeton, NJ 08541; msteffen@ets.org. Degrees: BS, MA, Stetson University; PhD, University of Iowa. Specializations: educational measurement, adaptive testing.

MARK KEVIN S1NGLEY is Director, Research, Educational Testing Service, Princeton, NJ 08541; ksingley@ets.org. Degrees: BA, Haverford College; PhD, Carnegie Mellon University. Specializations: cognitive science, human-computer interaction.

MARY MORLEY is Research Scientist, Educational Testing Service, Mail Stop 12-R, Rosedale Road, Princeton, NJ 08541. Degrees: BS, University of Maryland; MA, PhD, University of Chicago. Specializations: mathematics, testing mathematics.

DANIEL JACQUEMIN is Research Assistant, Educational Testing Service, Rosedale Road, Mail Stop l I-R, Princeton, NJ 08541; djacquemin@ets.org. Degree: BS, Rutgers, The State University of New Jersey. Specialization: mathematical scoring algorithms.


The first generation of computer-based tests depends largely on multiple-choice items and constructed-response questions that can be scored through literal matches with a key. This study evaluated scoring accuracy and item functioning for an open-ended response type where correct answers, posed as mathematical expressions, can take many different surface forms. Items were administered to 1,864 participants in field trials of a new admissions test for quantitatively oriented graduate programs. Results showed automatic scoring to approximate the accuracy of multiple-choice scanning, with all processing errors stemming from examinees improperly entering responses. In addition, the items functioned similarly in difficulty, item-total relations, and male-female performance differences to other response types being considered for the measure.