Scoring Constructed Responses Using Expert Systems

Authors


HENRY I. BRAUN is Vice-President for Research Management, Educational Testing Service, Princeton, NJ 08541. Degrees. BS, McGill University; MS, PhD, Stanford University. Specializations: Bayesian analyses, stochastic modeling, development of statistical methodology, and demographics.

RANDY ELLIOT BENNETT is Senior Research Scientist, Educational Testing Service, Princeton, NJ 08541. Degrees. BA, SUNY at Stony Brook; MA, EdM, EdD, Columbia University. Specializations: constructed-response assessment and automated approaches, measurement in special education, and diagnostic assessment.

DOUGLAS FRYE is Research Scientist, Computer Science Department, Yale University, 51 Prospect Street, New Haven, CT 06520. Degrees: BA, New York University; PhD, Yale University. Specialization: cognitive development.

ELLIOT SOLOWAY is Associate Professor, University of Michigan, 1101 Beal Ave., Rm. 152 ATL, Ann Arbor, MI 48109-2110. Degrees. BA, Ohio State; MS, PhD, University of MA. Specialization. computer science in education.

Abstract

The use of constructed-response items in large scale standardized testing has been hampered by the costs and difficulties associated with obtaining reliable scores. The advent of expert systems may signal the eventual removal of this impediment. This study investigated the accuracy with which expert systems could score a new, nonmultiple-choice item type. The item type presents a faulty solution to a computer programming problem and asks the student to correct the solution. This item type was administered to a sample of high school seniors enrolled in an Advanced Placement course in Computer Science who also took the Advanced Placement Computer Science (APCS) examination. Results indicated that the expert systems were able to produce scores for between 82% and 95% of the solutions encountered and to display high agreement with a human reader on the correctness of the solutions. Diagnoses of the specific errors produced by students were less accurate. Correlations with scores on the objective and free-response sections of the APCS examination were moderate. Implications for additional research and for testing practice are offered.

Ancillary