Estimating the Consistency and Accuracy of Classifications Based on Test Scores
Abstract
This article presents a method for estimating the accuracy and consistency of classifications based on test scores. The scores can be produced by any scoring method, including a weighted composite. The estimates use data from a single form. The reliability of the score is used to estimate effective test length in terms of discrete items. The true‐score distribution is estimated by fitting a 4‐parameter beta model. The conditional distribution of scores on an alternate form, given the true score, is estimated from a binomial distribution based on the estimated effective test length. Agreement between classifications on alternate forms is estimated by assuming conditional independence, given the true score. Evaluation of the method showed estimates to be within 1 percentage point of the actual values in most cases. Estimates of decision accuracy and decision consistency statistics were only slightly affected by changes in specified minimum and maximum possible scores.
Citing Literature
Number of times cited according to CrossRef: 67
- Geoffrey Phelps, Jonathan Steinberg, Dawn Leusner, Jennifer Minsky, Karen Castellano, Laura McCulla, ® Content Knowledge for Teaching: Initial Reliability and Validity Results for Elementary Reading Language Arts and Mathematics, ETS Research Report Series, 10.1002/ets2.12295, 0, 0, (2020).
- Marguerite Roy, Josée Wojcik, Ilona Bartman, Sydney Smee, Augmenting physician examiner scoring in objective structured clinical examinations: including the standardized patient perspective, Advances in Health Sciences Education, 10.1007/s10459-020-09987-6, (2020).
- Stella Y. Kim, Won‐Chan Lee, Classification Consistency and Accuracy With Atypical Score Distributions, Journal of Educational Measurement, 10.1111/jedm.12250, 57, 2, (286-310), (2019).
- Won‐Chan Lee, Stella Y. Kim, Jiwon Choi, Yujin Kang, IRT Approaches to Modeling Scores on Mixed‐Format Tests, Journal of Educational Measurement, 10.1111/jedm.12248, 57, 2, (230-254), (2019).
- Spiros Papageorgiou, Sha Wu, Ching‐Ni Hsieh, Richard J. Tannenbaum, Mengmeng Cheng, Mapping the TOEFL iBT® Test Scores to China's Standards of English Language Ability: Implications for Score Interpretation and Use, ETS Research Report Series, 10.1002/ets2.12281, 2019, 1, (1-49), (2019).
- Yigal Attali, Rater Certification Tests: A Psychometric Approach, Educational Measurement: Issues and Practice, 10.1111/emip.12248, 38, 2, (6-13), (2019).
- Sy Doan, Jonathan D. Schweig, Kata Mihaly, The Consistency of Composite Ratings of Teacher Effectiveness: Evidence From New Mexico, American Educational Research Journal, 10.3102/0002831219841369, (000283121984136), (2019).
- José Luis Lázaro-Cantabrana, Mireia Usart-Rodríguez, Mercè Gisbert-Cervera, Assessing Teacher Digital Competence: the Construction of an Instrument for Measuring the Knowledge of Pre-Service Teachers, Journal of New Approaches in Educational Research, 10.7821/naer.2019.1.370, 8, 1, (73-78), (2019).
- Serpil Çelikten, Mehtap Çakan, BAYESIAN VE NONBAYESIAN KESTİRİM YÖNTEMLERİNE DAYALI OLARAK SINIFLAMA İNDEKSLERİNİN TIMSS-2015 MATEMATİK TESTİ ÜZERİNDE İNCELENMESİ, Necatibey Eğitim Fakültesi Elektronik Fen ve Matematik Eğitimi Dergisi, 10.17522/balikesirnef.566446, (105-124), (2019).
- Rose Hatala, Jacqueline Gutman, Matthew Lineberry, Marc Triola, Martin Pusic, How well is each learner learning? Validity investigation of a learning curve-based assessment approach for ECG interpretation, Advances in Health Sciences Education, 10.1007/s10459-018-9846-x, 24, 1, (45-63), (2018).
- Hirotaka Onishi, Yoon Soo Park, Ryo Takayanagi, Yasuki Fujinuma, Combining Scores Based on Compensatory and Noncompensatory Scoring Rules to Assess Resident Readiness for Unsupervised Practice, Academic Medicine, 10.1097/ACM.0000000000002380, 93, (S45-S51), (2018).
- Jamie N. Mikeska, Christopher Kurzum, Jonathan H. Steinberg, Jun Xu, Assessing Elementary Teachers' Content Knowledge for Teaching Science for the ETS® Educator Series: Pilot Results, ETS Research Report Series, 10.1002/ets2.12207, 2018, 1, (1-30), (2018).
- Jonathan E. Schmidgall, Edward P. Getman, Jiyun Zu, Screener tests need validation too: Weighing an argument for test use against practical concerns, Language Testing, 10.1177/0265532217718600, 35, 4, (583-607), (2017).
- Michael Kane, Brent Bridgeman, Research on Validity Theory and Practice at ETS, Advancing Human Assessment, 10.1007/978-3-319-58689-2_16, (489-552), (2017).
- Tim Moses, Psychometric Contributions: Focus on Test Scores, Advancing Human Assessment, 10.1007/978-3-319-58689-2_3, (47-78), (2017).
- Shanshan Zhang, Jiaxuan Du, Ping Chen, Tao Xin, Fu Chen, Using Procedure Based on Item Response Theory to Evaluate Classification Consistency Indices in the Practice of Large-Scale Assessment, Frontiers in Psychology, 10.3389/fpsyg.2017.01676, 8, (2017).
- Donald Powers, Mary Schedl, Spiros Papageorgiou, Facilitating the interpretation of English language proficiency scores: Combining scale anchoring and test score mapping methodologies, Language Testing, 10.1177/0265532215623582, 34, 2, (175-195), (2016).
- Yoon Soo Park, Matthew Lineberry, Abbas Hyderi, Georges Bordage, Kuan Xing, Rachel Yudkowsky, Differential Weighting for Subcomponent Measures of Integrated Clinical Encounter Scores Based on the USMLE Step 2 CS Examination, Academic Medicine, 10.1097/ACM.0000000000001359, 91, (S24-S30), (2016).
- Chad W. Buckendahl, Jack D. Gerrow, Evaluating the Impact of Releasing an Item Pool on a Test's Empirical Characteristics, Journal of Dental Education, 10.1002/j.0022-0337.2016.80.10.tb06209.x, 80, 10, (1253-1260), (2016).
- Adam E. Wyse, Ben Babcock, Does Maximizing Information at the Cut Score Always Maximize Classification Accuracy and Consistency?, Journal of Educational Measurement, 10.1111/jedm.12099, 53, 1, (23-44), (2016).
- Joseph A. Martineau, An Expansion and Practical Evaluation of Expected Classification Accuracy, Applied Psychological Measurement, 10.1177/0146621606291557, 31, 3, (181-194), (2016).
- Timo M. Bechger, Gunter Maris, Huub H. F. M. Verstralen, Anton A. Béguin, Using Classical Test Theory in Combination with Item Response Theory, Applied Psychological Measurement, 10.1177/0146621603257518, 27, 5, (319-334), (2016).
- Won-Chan Lee, Bradley A. Hanson, Robert L. Brennan, Estimating Consistency and Accuracy Indices for Multiple Classifications, Applied Psychological Measurement, 10.1177/014662102237797, 26, 4, (412-432), (2016).
- JOHN HATTIE, RICHARD M. JAEGER, LLOYD BOND, Chapter 11 : Persistent Methodological Questions in Educational Testing, Review of Research in Education, 10.3102/0091732X024001393, 24, 1, (393-446), (2016).
- Lee J. Cronbach, Robert L. Linn, Robert L. Brennan, Edward H. Haertel, Generalizability Analysis for Performance Assessments of Student Achievement or School Effectiveness, Educational and Psychological Measurement, 10.1177/0013164497057003001, 57, 3, (373-399), (2016).
- Is This Reliable Enough? Examining Classification Consistency and Accuracy in a Criterion-Referenced Test, International Journal of Assessment Tools in Education, 10.21449/ijate.245198, (137-150), (2016).
- Matthew Lineberry, Yoon Soo Park, David A. Cook, Rachel Yudkowsky, Making the Case for Mastery Learning Assessments, Academic Medicine, 10.1097/ACM.0000000000000860, 90, 11, (1445-1450), (2015).
- Spiros Papageorgiou, Rick Morgan, Valerie Becker, Enhancing the Interpretability of the Overall Results of an International Test of English-Language Proficiency, International Journal of Testing, 10.1080/15305058.2015.1078335, 15, 4, (310-336), (2015).
- Spiros Papageorgiou, Xiaoming Xi, Rick Morgan, Youngsoon So, Developing and Validating Band Levels and Descriptors for Reporting Overall Examinee Performance, Language Assessment Quarterly, 10.1080/15434303.2015.1008480, 12, 2, (153-177), (2015).
- Stephen G. Sireci, Molly Faulkner-Bond, Promoting Validity in the Assessment of English Learners, Review of Research in Education, 10.3102/0091732X14557003, 39, 1, (215-252), (2015).
- Samuel A. Livingston, Haiwen H. Chen, Estimating Conditional Distributions of Scores on an Alternate Form of a Test, ETS Research Report Series, 10.1002/ets2.12066, 2015, 2, (1-7), (2015).
- Youngsoon So, Mikyung Kim Wolf, Maurice C. Hauck, Pamela Mollaun, Paul Rybinski, Daniel Tumposky, Lin Wang, TOEFL Junior® Design Framework, ETS Research Report Series, 10.1002/ets2.12058, 2015, 1, (1-45), (2015).
- Tim Moses, Sooyeon Kim, Methods for Evaluating Composite Reliability, Classification Consistency, and Classification Accuracy for Mixed-Format Licensure Tests, Applied Psychological Measurement, 10.1177/0146621614563067, 39, 4, (314-329), (2014).
- Debra (Dallie) Sandilands, Andrea Gotzmann, Marguerite Roy, Bruno D. Zumbo, André De Champlain, Weighting checklist items and station components on a large-scale OSCE: Is it worth the effort?, Medical Teacher, 10.3109/0142159X.2014.899687, 36, 7, (585-590), (2014).
- Quinn N. Lathrop, Ying Cheng, A Nonparametric Approach to Estimate Classification Accuracy and Consistency, Journal of Educational Measurement, 10.1111/jedm.12048, 51, 3, (318-334), (2014).
- Ronald K. Hambleton, Shuhong Li, Criterion‐Referenced Assessment, Wiley StatsRef: Statistics Reference Online, 10.1002/9781118445112, (2014).
- Tim Moses, ETS PSYCHOMETRIC CONTRIBUTIONS: FOCUS ON TEST SCORES, ETS Research Report Series, 10.1002/j.2333-8504.2013.tb02322.x, 2013, 1, (i-41), (2014).
- Sandip Sinharay, A Note on Assessing the Added Value of Subscores, Educational Measurement: Issues and Practice, 10.1111/emip.12021, 32, 4, (38-42), (2014).
- Neil J. Dorans, Longjuan Liang, Gautam Puhan, ALIGNING SCALES OF CERTIFICATION TESTS, ETS Research Report Series, 10.1002/j.2333-8504.2010.tb02214.x, 2010, 1, (i-26), (2014).
- Guido Magnano, Chiara Tannoia, Chiara Andrà, A Priori Reliability of Tests with Cut Score, Psychometrika, 10.1007/s11336-013-9371-z, 80, 1, (44-64), (2013).
- Nina Deng, Ronald K. Hambleton, Evaluating CTT- and IRT-Based Single-Administration Estimates of Classification Consistency and Accuracy, New Developments in Quantitative Psychology, 10.1007/978-1-4614-9348-8_15, (235-250), (2013).
- Tom Benton, Exploring equivalent forms reliability using a key stage 2 reading test, Research Papers in Education, 10.1080/02671522.2012.754227, 28, 1, (57-74), (2013).
- Qingping He, Malcolm Hayes, Dylan Wiliam, Classification accuracy in Key Stage 2 National Curriculum tests in England, Research Papers in Education, 10.1080/02671522.2012.754225, 28, 1, (22-42), (2013).
- Alister Hodge, Andrew Hugman, Wayne Varndell, Kylie Howes, A review of the quality assurance processes for the Australasian Triage Scale (ATS) and implications for future practice, Australasian Emergency Nursing Journal, 10.1016/j.aenj.2012.12.003, 16, 1, (21-29), (2013).
- Jules L. Ellis, Probability interpretations of intraclass reliabilities, Statistics in Medicine, 10.1002/sim.5853, 32, 26, (4596-4608), (2013).
- Sheldon I. Green, Student Assessment Precision in Mechanical Engineering Courses, Journal of Engineering Education, 10.1002/j.2168-9830.2005.tb00848.x, 94, 2, (273-278), (2013).
- Qingping He, Dennis Opposs, The reliability of results from national tests, public examinations, and vocational qualifications in England, Educational Research and Evaluation, 10.1080/13803611.2012.731777, 18, 8, (779-799), (2012).
- Adam E. Wyse, Shiqi Hao, An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices, Applied Psychological Measurement, 10.1177/0146621612451522, 36, 7, (602-624), (2012).
- Sooyeon Kim, Michael E. Walker, Kevin Larkin, Examining Possible Construct Changes to a Licensure Test by Evaluating Equating Requirements, International Journal of Testing, 10.1080/15305058.2011.645974, 12, 4, (365-381), (2012).
- Gautam Puhan, Leanne Gall, Reliability of Pass and Fail Decisions on Tests Employing Cut Scores, Psychological Studies, 10.1007/s12646-012-0147-9, 57, 3, (273-282), (2012).
- Ying Cui, Mark J. Gierl, Hua‐Hua Chang, Estimating Classification Consistency and Accuracy for Cognitive Diagnostic Assessment, Journal of Educational Measurement, 10.1111/j.1745-3984.2011.00158.x, 49, 1, (19-38), (2012).
- undefined Bo Zhang, Assessing the accuracy and consistency of language proficiency classification under competing measurement models, Language Testing, 10.1177/0265532209347363, 27, 1, (119-140), (2010).
- Karen M. Douglas, Robert J. Mislevy, Estimating Classification Accuracy for Complex Decision Rules Based on Multiple Scores, Journal of Educational and Behavioral Statistics, 10.3102/1076998609346969, 35, 3, (280-306), (2010).
- Tom Bramley, A response to an article published in Educational Research 's Special Issue on Assessment (June 2009). What can be inferred about classification accuracy from classification consistency? , Educational Research, 10.1080/00131881.2010.504067, 52, 3, (325-330), (2010).
- Won‐Chan Lee, Classification Consistency and Accuracy for Complex Assessments Using Item Response Theory, Journal of Educational Measurement, 10.1111/j.1745-3984.2009.00096.x, 47, 1, (1-17), (2010).
- Won-Chan Lee, Robert L. Brennan, Lei Wan, Classification Consistency and Accuracy for Complex Assessments Under the Compound Multinomial Model, Applied Psychological Measurement, 10.1177/0146621608321759, 33, 5, (374-390), (2009).
- Matthew Finkelman, Mark Darby, Michael Nering, A Two-Stage Scoring Method to Enhance Accuracy of Performance Level Classification, Educational and Psychological Measurement, 10.1177/0013164408322025, 69, 1, (5-17), (2008).
- Damian W. Betebenner, Yi Shang, Yun Xiang, Yan Zhao, Xiaohui Yue, The Impact of Performance Level Misclassification on the Accuracy and Precision of Percent at Performance Level Measures, Journal of Educational Measurement, 10.1111/j.1745-3984.2007.00056.x, 45, 2, (119-137), (2008).
- Scott F. Marion, James W. Pellegrino, A Validity Framework for Evaluating the Technical Quality of Alternate Assessments, Educational Measurement: Issues and Practice, 10.1111/j.1745-3992.2006.00078.x, 25, 4, (47-57), (2007).
- Haggai Kupermintz, On the Reliability of Categorically Scored Examinations, Journal of Educational Measurement, 10.1111/j.1745-3984.2004.tb01162.x, 41, 3, (193-204), (2006).
- Clement A. Stone, Alexander Weissman, Suzanne Lane, The Consistency of Student Proficiency Classifications Under Competing IRT Models, Educational Assessment, 10.1207/s15326977ea1002_3, 10, 2, (125-146), (2005).
- Howard Wainer, X. A. Wang, William P. Skorupski, Eric T. Bradlow, A Bayesian Method for Evaluating Passing Scores: The PPoP Curve, Journal of Educational Measurement, 10.1111/j.1745-3984.2005.00014.x, 42, 3, (271-281), (2005).
- Ronald K. Hambleton, Shuhong Li, Criterion‐Referenced Assessment, Encyclopedia of Statistics in Behavioral Science, 10.1002/0470013192, (2005).
- Stephen G. Sireci, Preston C. Green, Legal and Psychometric Criteria for Evaluating Teacher Certification Tests, Educational Measurement: Issues and Practice, 10.1111/j.1745-3992.2000.tb00019.x, 19, 1, (22-31), (2005).
- Peter Nyström, Reliability of educational assessments: the case of classification accuracy, Scandinavian Journal of Educational Research, 10.1080/0031383042000245816, 48, 4, (427-440), (2004).
- Kadriye Ercikan, Marc Julian, Classification Accuracy of Assigning Student Performance to Proficiency Levels: Guidelines for Assessment Design, Applied Measurement in Education, 10.1207/S15324818AME1503_3, 15, 3, (269-294), (2002).
- Ronald K. Hambleton, Sharon C. Slater, Reliability of Credentialing Examinations and the Impact of Scoring Models and Standard-Setting Policies, Applied Measurement in Education, 10.1207/s15324818ame1001_2, 10, 1, (19-28), (1997).




