Volume 32, Issue 2

Estimating the Consistency and Accuracy of Classifications Based on Test Scores

Samuel A. Livingston

Corresponding Author

Educational Testing Service

SAMUEL A. LIVINGSTON is Senior Measurement Statistician, Educational Testing Service, Princeton, NJ 08541. Degrees: AB, University of Chicago; EdM, University of Pittsburgh; PhD, AM, Johns Hopkins University. Specializations: educational mea‐ surement and statistics

CHARLES LEWIS is Principal Research Scientist, Educational Testing Service, Princeton, NJ 08541. Degrees: BA, Swarthmore College; PhD, Princeton University. Specializations: Bayesian statistics and psychometrics

Search for more papers by this author
Charles Lewis

Corresponding Author

Educational Testing Service

SAMUEL A. LIVINGSTON is Senior Measurement Statistician, Educational Testing Service, Princeton, NJ 08541. Degrees: AB, University of Chicago; EdM, University of Pittsburgh; PhD, AM, Johns Hopkins University. Specializations: educational mea‐ surement and statistics

CHARLES LEWIS is Principal Research Scientist, Educational Testing Service, Princeton, NJ 08541. Degrees: BA, Swarthmore College; PhD, Princeton University. Specializations: Bayesian statistics and psychometrics

Search for more papers by this author
First published: June 1995
Citations: 67

Abstract

This article presents a method for estimating the accuracy and consistency of classifications based on test scores. The scores can be produced by any scoring method, including a weighted composite. The estimates use data from a single form. The reliability of the score is used to estimate effective test length in terms of discrete items. The true‐score distribution is estimated by fitting a 4‐parameter beta model. The conditional distribution of scores on an alternate form, given the true score, is estimated from a binomial distribution based on the estimated effective test length. Agreement between classifications on alternate forms is estimated by assuming conditional independence, given the true score. Evaluation of the method showed estimates to be within 1 percentage point of the actual values in most cases. Estimates of decision accuracy and decision consistency statistics were only slightly affected by changes in specified minimum and maximum possible scores.

Number of times cited according to CrossRef: 67

  • ® Content Knowledge for Teaching: Initial Reliability and Validity Results for Elementary Reading Language Arts and Mathematics, ETS Research Report Series, 10.1002/ets2.12295, 0, 0, (2020).
  • Augmenting physician examiner scoring in objective structured clinical examinations: including the standardized patient perspective, Advances in Health Sciences Education, 10.1007/s10459-020-09987-6, (2020).
  • Classification Consistency and Accuracy With Atypical Score Distributions, Journal of Educational Measurement, 10.1111/jedm.12250, 57, 2, (286-310), (2019).
  • IRT Approaches to Modeling Scores on Mixed‐Format Tests, Journal of Educational Measurement, 10.1111/jedm.12248, 57, 2, (230-254), (2019).
  • Mapping the TOEFL iBT® Test Scores to China's Standards of English Language Ability: Implications for Score Interpretation and Use, ETS Research Report Series, 10.1002/ets2.12281, 2019, 1, (1-49), (2019).
  • Rater Certification Tests: A Psychometric Approach, Educational Measurement: Issues and Practice, 10.1111/emip.12248, 38, 2, (6-13), (2019).
  • The Consistency of Composite Ratings of Teacher Effectiveness: Evidence From New Mexico, American Educational Research Journal, 10.3102/0002831219841369, (000283121984136), (2019).
  • Assessing Teacher Digital Competence: the Construction of an Instrument for Measuring the Knowledge of Pre-Service Teachers, Journal of New Approaches in Educational Research, 10.7821/naer.2019.1.370, 8, 1, (73-78), (2019).
  • BAYESIAN VE NONBAYESIAN KESTİRİM YÖNTEMLERİNE DAYALI OLARAK SINIFLAMA İNDEKSLERİNİN TIMSS-2015 MATEMATİK TESTİ ÜZERİNDE İNCELENMESİ, Necatibey Eğitim Fakültesi Elektronik Fen ve Matematik Eğitimi Dergisi, 10.17522/balikesirnef.566446, (105-124), (2019).
  • How well is each learner learning? Validity investigation of a learning curve-based assessment approach for ECG interpretation, Advances in Health Sciences Education, 10.1007/s10459-018-9846-x, 24, 1, (45-63), (2018).
  • Combining Scores Based on Compensatory and Noncompensatory Scoring Rules to Assess Resident Readiness for Unsupervised Practice, Academic Medicine, 10.1097/ACM.0000000000002380, 93, (S45-S51), (2018).
  • Assessing Elementary Teachers' Content Knowledge for Teaching Science for the ETS® Educator Series: Pilot Results, ETS Research Report Series, 10.1002/ets2.12207, 2018, 1, (1-30), (2018).
  • Screener tests need validation too: Weighing an argument for test use against practical concerns, Language Testing, 10.1177/0265532217718600, 35, 4, (583-607), (2017).
  • Research on Validity Theory and Practice at ETS, Advancing Human Assessment, 10.1007/978-3-319-58689-2_16, (489-552), (2017).
  • Psychometric Contributions: Focus on Test Scores, Advancing Human Assessment, 10.1007/978-3-319-58689-2_3, (47-78), (2017).
  • Using Procedure Based on Item Response Theory to Evaluate Classification Consistency Indices in the Practice of Large-Scale Assessment, Frontiers in Psychology, 10.3389/fpsyg.2017.01676, 8, (2017).
  • Facilitating the interpretation of English language proficiency scores: Combining scale anchoring and test score mapping methodologies, Language Testing, 10.1177/0265532215623582, 34, 2, (175-195), (2016).
  • Differential Weighting for Subcomponent Measures of Integrated Clinical Encounter Scores Based on the USMLE Step 2 CS Examination, Academic Medicine, 10.1097/ACM.0000000000001359, 91, (S24-S30), (2016).
  • Evaluating the Impact of Releasing an Item Pool on a Test's Empirical Characteristics, Journal of Dental Education, 10.1002/j.0022-0337.2016.80.10.tb06209.x, 80, 10, (1253-1260), (2016).
  • Does Maximizing Information at the Cut Score Always Maximize Classification Accuracy and Consistency?, Journal of Educational Measurement, 10.1111/jedm.12099, 53, 1, (23-44), (2016).
  • An Expansion and Practical Evaluation of Expected Classification Accuracy, Applied Psychological Measurement, 10.1177/0146621606291557, 31, 3, (181-194), (2016).
  • Using Classical Test Theory in Combination with Item Response Theory, Applied Psychological Measurement, 10.1177/0146621603257518, 27, 5, (319-334), (2016).
  • Estimating Consistency and Accuracy Indices for Multiple Classifications, Applied Psychological Measurement, 10.1177/014662102237797, 26, 4, (412-432), (2016).
  • Chapter 11 : Persistent Methodological Questions in Educational Testing, Review of Research in Education, 10.3102/0091732X024001393, 24, 1, (393-446), (2016).
  • Generalizability Analysis for Performance Assessments of Student Achievement or School Effectiveness, Educational and Psychological Measurement, 10.1177/0013164497057003001, 57, 3, (373-399), (2016).
  • Is This Reliable Enough? Examining Classification Consistency and Accuracy in a Criterion-Referenced Test, International Journal of Assessment Tools in Education, 10.21449/ijate.245198, (137-150), (2016).
  • Making the Case for Mastery Learning Assessments, Academic Medicine, 10.1097/ACM.0000000000000860, 90, 11, (1445-1450), (2015).
  • Enhancing the Interpretability of the Overall Results of an International Test of English-Language Proficiency, International Journal of Testing, 10.1080/15305058.2015.1078335, 15, 4, (310-336), (2015).
  • Developing and Validating Band Levels and Descriptors for Reporting Overall Examinee Performance, Language Assessment Quarterly, 10.1080/15434303.2015.1008480, 12, 2, (153-177), (2015).
  • Promoting Validity in the Assessment of English Learners, Review of Research in Education, 10.3102/0091732X14557003, 39, 1, (215-252), (2015).
  • Estimating Conditional Distributions of Scores on an Alternate Form of a Test, ETS Research Report Series, 10.1002/ets2.12066, 2015, 2, (1-7), (2015).
  • TOEFL Junior® Design Framework, ETS Research Report Series, 10.1002/ets2.12058, 2015, 1, (1-45), (2015).
  • Methods for Evaluating Composite Reliability, Classification Consistency, and Classification Accuracy for Mixed-Format Licensure Tests, Applied Psychological Measurement, 10.1177/0146621614563067, 39, 4, (314-329), (2014).
  • Weighting checklist items and station components on a large-scale OSCE: Is it worth the effort?, Medical Teacher, 10.3109/0142159X.2014.899687, 36, 7, (585-590), (2014).
  • A Nonparametric Approach to Estimate Classification Accuracy and Consistency, Journal of Educational Measurement, 10.1111/jedm.12048, 51, 3, (318-334), (2014).
  • Criterion‐Referenced Assessment, Wiley StatsRef: Statistics Reference Online, 10.1002/9781118445112, (2014).
  • ETS PSYCHOMETRIC CONTRIBUTIONS: FOCUS ON TEST SCORES, ETS Research Report Series, 10.1002/j.2333-8504.2013.tb02322.x, 2013, 1, (i-41), (2014).
  • A Note on Assessing the Added Value of Subscores, Educational Measurement: Issues and Practice, 10.1111/emip.12021, 32, 4, (38-42), (2014).
  • ALIGNING SCALES OF CERTIFICATION TESTS, ETS Research Report Series, 10.1002/j.2333-8504.2010.tb02214.x, 2010, 1, (i-26), (2014).
  • A Priori Reliability of Tests with Cut Score, Psychometrika, 10.1007/s11336-013-9371-z, 80, 1, (44-64), (2013).
  • Evaluating CTT- and IRT-Based Single-Administration Estimates of Classification Consistency and Accuracy, New Developments in Quantitative Psychology, 10.1007/978-1-4614-9348-8_15, (235-250), (2013).
  • Exploring equivalent forms reliability using a key stage 2 reading test, Research Papers in Education, 10.1080/02671522.2012.754227, 28, 1, (57-74), (2013).
  • Classification accuracy in Key Stage 2 National Curriculum tests in England, Research Papers in Education, 10.1080/02671522.2012.754225, 28, 1, (22-42), (2013).
  • A review of the quality assurance processes for the Australasian Triage Scale (ATS) and implications for future practice, Australasian Emergency Nursing Journal, 10.1016/j.aenj.2012.12.003, 16, 1, (21-29), (2013).
  • Probability interpretations of intraclass reliabilities, Statistics in Medicine, 10.1002/sim.5853, 32, 26, (4596-4608), (2013).
  • Student Assessment Precision in Mechanical Engineering Courses, Journal of Engineering Education, 10.1002/j.2168-9830.2005.tb00848.x, 94, 2, (273-278), (2013).
  • The reliability of results from national tests, public examinations, and vocational qualifications in England, Educational Research and Evaluation, 10.1080/13803611.2012.731777, 18, 8, (779-799), (2012).
  • An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices, Applied Psychological Measurement, 10.1177/0146621612451522, 36, 7, (602-624), (2012).
  • Examining Possible Construct Changes to a Licensure Test by Evaluating Equating Requirements, International Journal of Testing, 10.1080/15305058.2011.645974, 12, 4, (365-381), (2012).
  • Reliability of Pass and Fail Decisions on Tests Employing Cut Scores, Psychological Studies, 10.1007/s12646-012-0147-9, 57, 3, (273-282), (2012).
  • Estimating Classification Consistency and Accuracy for Cognitive Diagnostic Assessment, Journal of Educational Measurement, 10.1111/j.1745-3984.2011.00158.x, 49, 1, (19-38), (2012).
  • Assessing the accuracy and consistency of language proficiency classification under competing measurement models, Language Testing, 10.1177/0265532209347363, 27, 1, (119-140), (2010).
  • Estimating Classification Accuracy for Complex Decision Rules Based on Multiple Scores, Journal of Educational and Behavioral Statistics, 10.3102/1076998609346969, 35, 3, (280-306), (2010).
  • A response to an article published in Educational Research 's Special Issue on Assessment (June 2009). What can be inferred about classification accuracy from classification consistency? , Educational Research, 10.1080/00131881.2010.504067, 52, 3, (325-330), (2010).
  • Classification Consistency and Accuracy for Complex Assessments Using Item Response Theory, Journal of Educational Measurement, 10.1111/j.1745-3984.2009.00096.x, 47, 1, (1-17), (2010).
  • Classification Consistency and Accuracy for Complex Assessments Under the Compound Multinomial Model, Applied Psychological Measurement, 10.1177/0146621608321759, 33, 5, (374-390), (2009).
  • A Two-Stage Scoring Method to Enhance Accuracy of Performance Level Classification, Educational and Psychological Measurement, 10.1177/0013164408322025, 69, 1, (5-17), (2008).
  • The Impact of Performance Level Misclassification on the Accuracy and Precision of Percent at Performance Level Measures, Journal of Educational Measurement, 10.1111/j.1745-3984.2007.00056.x, 45, 2, (119-137), (2008).
  • A Validity Framework for Evaluating the Technical Quality of Alternate Assessments, Educational Measurement: Issues and Practice, 10.1111/j.1745-3992.2006.00078.x, 25, 4, (47-57), (2007).
  • On the Reliability of Categorically Scored Examinations, Journal of Educational Measurement, 10.1111/j.1745-3984.2004.tb01162.x, 41, 3, (193-204), (2006).
  • The Consistency of Student Proficiency Classifications Under Competing IRT Models, Educational Assessment, 10.1207/s15326977ea1002_3, 10, 2, (125-146), (2005).
  • A Bayesian Method for Evaluating Passing Scores: The PPoP Curve, Journal of Educational Measurement, 10.1111/j.1745-3984.2005.00014.x, 42, 3, (271-281), (2005).
  • Criterion‐Referenced Assessment, Encyclopedia of Statistics in Behavioral Science, 10.1002/0470013192, (2005).
  • Legal and Psychometric Criteria for Evaluating Teacher Certification Tests, Educational Measurement: Issues and Practice, 10.1111/j.1745-3992.2000.tb00019.x, 19, 1, (22-31), (2005).
  • Reliability of educational assessments: the case of classification accuracy, Scandinavian Journal of Educational Research, 10.1080/0031383042000245816, 48, 4, (427-440), (2004).
  • Classification Accuracy of Assigning Student Performance to Proficiency Levels: Guidelines for Assessment Design, Applied Measurement in Education, 10.1207/S15324818AME1503_3, 15, 3, (269-294), (2002).
  • Reliability of Credentialing Examinations and the Impact of Scoring Models and Standard-Setting Policies, Applied Measurement in Education, 10.1207/s15324818ame1001_2, 10, 1, (19-28), (1997).

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.