‘Mental Model’ Comparison of Automated and Human Scoring


  • The authors would like to extend their gratitude to the editor and two anonymous reviewers for their insightful and constructive commentaries on earlier drafts of this paper. Any remaining deficiencies are, of course, solely the authors. The authors would like to thank Clark Chalifour and Dick DeVore for their assistance conducting this study.

  • Correspondence concerning this article should be addressed to David M. Williamson, The Chauncey Group International, 664 Rosedale Road, Princeton, New Jersey 08540.

DAVID M. WILLIAMSON is Senior Measurement Statistician, The Chauncey Group International, 664 Rosedale Road, Princeton, NJ 08540; dwilliamson@chauncey.com. Degrees: BS, Southwest Missouri State University; MA, Fordham University. Specialization: psychometrics.

ISAAC I. BEJAR is Principal Research Scientist, Educational Testing Service, 664 Rosedale Road, Princeton, NJ 08540; ibejar@ets.org. Degrees: BA, Interamerican University; MA, PhD, University of Minnesota. Specializations: item generation, automated scoring.

ANNE S. HONE is Senior Associate, The Chauncey Group International, 664 Rosedale Road, Princeton, NJ 08540; ahone@chauncey.com. Degrees: BA, Amherst College; MA, Columbia University. Specializations: architecture, applied measurement.


‘Mental models’ used by automated scoring for the simulation divisions of the computerized Architect Registration Examination are contrasted with those used by experienced human graders. Candidate solutions (N = 3613) received both automated and human holistic scores. Quantitative analyses suggest high correspondence between automated and human scores; thereby suggesting similar mental models are implemented. Solutions with discrepancies between automated and human scores were selected for qualitative analysis. The human graders were reconvened to review the human scores and to investigate the source of score discrepancies in light of rationales provided by the automated scoring process. After review, slightly more than half of the score discrepancies were reduced or eliminated. Six sources of discrepancy between original human scores and automated scores were identified: subjective criteria; objective criteria; tolerances/ weighting; details; examinee task interpretation; and unjustified. The tendency of the human graders to be compelled by automated score rationales varied by the nature of original score discrepancy. We determine that, while the automated scores are based on a mental model consistent with that of expert graders, there remain some important differences, both intentional and incidental, which distinguish between human and automated scoring. We conclude that automated scoring has the potential to enhance the validity evidence of scores in addition to improving efficiency.