Volume 27, Issue 1
Article

Beyond kappa: A review of interrater agreement measures

Mousumi Banerjee

E-mail address: banerjee@kci.wayne.edu

Center for Healthcare Effectiveness Research Wayne Stale University School of Medicine Detroit, Michigan 48201 U.S.A.

Search for more papers by this author
Michelle Capozzoli

Department of Mathematics University of New Hampshire Durham, New Hampshire 03824 U.S.A.

Search for more papers by this author
Laura McSweeney

Department of Mathematics University of New Hampshire Durham, New Hampshire 03824 U.S.A.

Search for more papers by this author
Debajyoti Sinha

Department of Mathematics University of New Hampshire Durham, New Hampshire 03824 U.S.A.

Search for more papers by this author
First published: 18 December 2008
Citations: 455

Abstract

en

In 1960, Cohen introduced the kappa coefficient to measure chance‐corrected nominal scale agreement between two raters. Since then, numerous extensions and generalizations of this interrater agreement measure have been proposed in the literature. This paper reviews and critiques various approaches to the study of interrater agreement, for which the relevant data comprise either nominal or ordinal categorical ratings from multiple raters. It presents a comprehensive compilation of the main statistical approaches to this problem, descriptions and characterizations of the underlying models, and discussions of related statistical methodologies for estimation and confidence‐interval construction. The emphasis is on various practical scenarios and designs that underlie the development of these measures, and the interrelationships between them.

Abstract

fr

C'est en 1960 que Cohen a proposé l'emploi du coefficient kappa comme outil de mesure de l'accord entre deux eévaluateurs exprimant leur jugement au moyen d'une échelle nominale. De nombreuses généralisations de cette mesure d'accord ont été proposées depuis lors. Les auteurs jettent ici un regard critique sur nombre de ces travaux traitant du cas où l'échelle de réponse est soit nominale, soil ordinale. Les principales approches statistiques sont passées en revue, les modéles sous‐jacents sont décrits et caractérisés, et les problémes liés à l'estimation ponctuelle ou par intervalle sont abordés. L'accent est mis sur différents scénarios concrets et sur des schémas expérimentaux qui sous‐tendent l'emploi de ces mesures et les relations existant entre elles.

Number of times cited according to CrossRef: 455

  • The determination of appropriate coefficient indices for inter-rater reliability: Using classroom observation instruments as fidelity measures in large-scale randomized research, International Journal of Educational Research, 10.1016/j.ijer.2019.101514, 99, (101514), (2020).
  • Patient needs and preferences in relapsing-remitting multiple sclerosis: a systematic review, Multiple Sclerosis and Related Disorders, 10.1016/j.msard.2020.101929, (101929), (2020).
  • Explaining the unsuitability of the kappa coefficient in the assessment and comparison of the accuracy of thematic maps obtained by image classification, Remote Sensing of Environment, 10.1016/j.rse.2019.111630, 239, (111630), (2020).
  • Comparison of different rating scales for the use in Delphi studies: different scales lead to different consensus and show different test-retest reliability, BMC Medical Research Methodology, 10.1186/s12874-020-0912-8, 20, 1, (2020).
  • The authors reply, Critical Care Medicine, 10.1097/CCM.0000000000004194, 48, 3, (e263-e265), (2020).
  • Modification and validation of the mixed‐format Engineering Concept Assessment for middle school students using many‐facet Rasch measurement, School Science and Mathematics, 10.1111/ssm.12405, 120, 5, (309-321), (2020).
  • Examining preservice teachers' responses to area conservation tasks, School Science and Mathematics, 10.1111/ssm.12409, 120, 5, (262-272), (2020).
  • Optimized truncation to integrate multi‐channel MRS data using rank‐R singular value decomposition, NMR in Biomedicine, 10.1002/nbm.4297, 33, 7, (2020).
  • High-resolution broad-scale mapping of soil parent material using object-based image analysis (OBIA) of LiDAR elevation data, CATENA, 10.1016/j.catena.2019.104422, 188, (104422), (2020).
  • Assessment of Cardiovascular Disease Risk among Qatari Patients with Type 2 Diabetes Mellitus, Attending Primary Health Care Centers, 2014, The Open Diabetes Journal, 10.2174/1876524602010010001, 10, 1, (1-10), (2020).
  • Design, Validation, and Testing of an Observational Tool for Technical and Tactical Analysis in the Taekwondo Competition at the 2016 Olympic Games, Physiology & Behavior, 10.1016/j.physbeh.2020.112980, (112980), (2020).
  • GRADING JOURNALS IN ECONOMICS: THE ABCS OF THE ABDC, Journal of Economic Surveys, 10.1111/joes.12369, 34, 4, (876-921), (2020).
  • Pathologists should probably forget about kappa. Percent agreement, diagnostic specificity and related metrics provide more clinically applicable measures of interobserver variability, Annals of Diagnostic Pathology, 10.1016/j.anndiagpath.2020.151561, 47, (151561), (2020).
  • Biostatistiques médicales avec GMRC Shiny Stats – un outil de formation par la pratique, Annales Pharmaceutiques Françaises, 10.1016/j.pharma.2020.06.001, (2020).
  • IT-enabled organizational transformation: a structured literature review, Business Process Management Journal, 10.1108/BPMJ-10-2019-0423, ahead-of-print, ahead-of-print, (2020).
  • Mapping the prevalence of the neglected sexual side effects after prostate cancer treatment and the questionnaires used in their screening: a scoping review protocol, Systematic Reviews, 10.1186/s13643-020-01473-9, 9, 1, (2020).
  • Four Galore? The Overlap between Mary Douglas’s Grid-Group Typology and Other Highly Cited Social Science Classifications, Sociological Theory, 10.1177/0735275120946085, 38, 3, (263-294), (2020).
  • CSR Strategies for (Re)gaining Legitimacy, Values and Corporate Responsibility, 10.1007/978-3-030-52466-1_8, (187-208), (2020).
  • “Think globally, act locally”: A glocal approach to the development of social media literacy, Computers & Education, 10.1016/j.compedu.2020.104025, (104025), (2020).
  • Linear Classifier Combination via Multiple Potential Functions, Pattern Recognition, 10.1016/j.patcog.2020.107681, (107681), (2020).
  • Motor and verbal inhibitory control: development and validity of the go/No-Go app test for children with development coordination disorder , Applied Neuropsychology: Child, 10.1080/21622965.2020.1726178, (1-10), (2020).
  • Measuring intrarater association between correlated ordinal ratings, Biometrical Journal, 10.1002/bimj.201900177, 0, 0, (2020).
  • Severity Index for Suspected Arbovirus (SISA): Machine learning for accurate prediction of hospitalization in subjects suspected of arboviral infection, PLOS Neglected Tropical Diseases, 10.1371/journal.pntd.0007969, 14, 2, (e0007969), (2020).
  • An exploratory study on application of various classification models to distinguish switchable-hydrophilicity solvents based on 3D-descriptors, Separation Science and Technology, 10.1080/01496395.2020.1744654, (1-9), (2020).
  • On (Mis)perceptions of testing effectiveness: an empirical study, Empirical Software Engineering, 10.1007/s10664-020-09805-y, (2020).
  • Association of Cervical Spondylosis With Peripheral Vertigo: A Case–Control Study, The Laryngoscope, 10.1002/lary.28715, 0, 0, (2020).
  • Secondary Science Teachers’ Definition and Use of Data in Their Teaching Practice, Research in Science Education, 10.1007/s11165-020-09936-8, (2020).
  • Identify the Relevant Pages of Book to be Indexed Using Naive Bayes Classification Method, IOP Conference Series: Materials Science and Engineering, 10.1088/1757-899X/722/1/012043, 722, (012043), (2020).
  • Methods of assessing categorical agreement between correlated screening tests in clinical studies, Journal of Applied Statistics, 10.1080/02664763.2020.1777394, (1-21), (2020).
  • Phonetic documentation in three collections: Topics and evolution, Journal of the International Phonetic Association, 10.1017/S0025100320000079, (1-27), (2020).
  • Mapping the Scholarship of Fake News Research: A Systematic Review, Journalism Practice, 10.1080/17512786.2020.1805791, (1-31), (2020).
  • Critical Race Theory (CRT) and colorism: a manifestation of whitewashing in marketing communications?, Journal of Marketing Management, 10.1080/0267257X.2020.1794934, (1-24), (2020).
  • Maximizing Cross-Cultural Learning From Exchange Study Abroad Programs: Transformative Learning Theory, Journal of Studies in International Education, 10.1177/1028315320906163, (102831532090616), (2020).
  • Evidence‐based statistical analysis and methods in biomedical research (SAMBR) checklists according to design features, CANCER REPORTS, 10.1002/cnr2.1211, 3, 4, (2019).
  • Modeling groundwater nitrate exposure in private wells of North Carolina for the Agricultural Health Study, Science of The Total Environment, 10.1016/j.scitotenv.2018.11.022, 655, (512-519), (2019).
  • A measure of agreement across numerous conditions: assessing when changes in network structures are tissue-specific, BMC Genomics, 10.1186/s12864-018-5340-3, 20, 1, (2019).
  • Chapter 10. Emotion and language ‘at work’, Emotion in Discourse, 10.1075/pbns.302.10alb, (247-278), (2019).
  • Development of an algorithm for evaluating the impact of measurement variability on response categorization in oncology trials, BMC Medical Research Methodology, 10.1186/s12874-019-0727-7, 19, 1, (2019).
  • Kappa Coefficients for Missing Data, Educational and Psychological Measurement, 10.1177/0013164418823249, 79, 3, (558-576), (2019).
  • Interrater Agreement, Introduction to Interrater Agreement for Nominal Data, 10.1007/978-3-030-11671-2, (21-78), (2019).
  • A systematic review of prevention programs targeting depression, anxiety, and stress in university students, Journal of Affective Disorders, 10.1016/j.jad.2019.06.035, (2019).
  • undefined, Proceedings of the 2019 ACM Conference on International Computing Education Research - ICER '19, 10.1145/3291279.3339408, (177-185), (2019).
  • Assessing the Quality of Evidence Presented at Annual General Meetings, Journal of Continuing Education in the Health Professions, 10.1097/CEH.0000000000000244, 39, 2, (152-157), (2019).
  • undefined, Proceedings of the 50th ACM Technical Symposium on Computer Science Education - SIGCSE '19, 10.1145/3287324.3287374, (531-537), (2019).
  • Reliability of colour and hardness clinical examinations in detecting dentine caries severity: a systematic review and meta-analysis, Scientific Reports, 10.1038/s41598-019-41270-6, 9, 1, (2019).
  • An application of brand personality dimensions to container ports: A place branding perspective, Journal of Transport Geography, 10.1016/j.jtrangeo.2019.102552, (102552), (2019).
  • Brazilian-Portuguese Linguistic Validation of the Velopharyngeal Insufficiency Effects on Life Outcome Instrument, Journal of Craniofacial Surgery, 10.1097/SCS.0000000000005679, 30, 8, (2308-2312), (2019).
  • Snapping, pinning, liking or texting: Investigating social media in higher education beyond Facebook, The Internet and Higher Education, 10.1016/j.iheduc.2019.100707, (100707), (2019).
  • undefined, 2019 IEEE Intelligent Transportation Systems Conference (ITSC), 10.1109/ITSC.2019.8916875, (116-121), (2019).
  • Integrative Double Kaizen Loop (IDKL): Towards a Culture of Continuous Learning and Sustainable Improvements for Software Organizations, IEEE Transactions on Software Engineering, 10.1109/TSE.2018.2829722, 45, 12, (1189-1210), (2019).
  • Student conceptual resources for understanding mechanical wave propagation, Physical Review Physics Education Research, 10.1103/PhysRevPhysEducRes.15.020127, 15, 2, (2019).
  • Why some are more equal: Family firm heterogeneity and the effect on management’s attention to CSR, Business Ethics: A European Review, 10.1111/beer.12225, 28, 3, (321-334), (2019).
  • Inter‐rater reliability of phenotypes and exploratory genotype–phenotype analysis in inherited hidradenitis suppurativa, British Journal of Dermatology, 10.1111/bjd.17695, 181, 3, (566-571), (2019).
  • Substance Use Disorder Education in Medical Schools, Academic Medicine, 10.1097/ACM.0000000000002883, 94, 11, (1825-1834), (2019).
  • Perceptions of Physical Activity While Breastfeeding Using the Self-determination Theory, Journal of Womenʼs Health Physical Therapy, 10.1097/JWH.0000000000000146, 43, 4, (180-187), (2019).
  • Prediction of Anaerobic Digestion Performance and Identification of Critical Operational Parameters Using Machine Learning Algorithms, Bioresource Technology, 10.1016/j.biortech.2019.122495, (122495), (2019).
  • Testing the intergenerational model of transmission of risk for chronic pain from parents to their children, PAIN, 10.1097/j.pain.0000000000001658, 160, 11, (2544-2553), (2019).
  • Evaluating natural resource planning for longleaf pine ecosystems in the Southeast United States, Forest Policy and Economics, 10.1016/j.forpol.2018.11.008, 100, (142-153), (2019).
  • undefined, 2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA), 10.1109/IISA.2019.8900682, (1-8), (2019).
  • Conceptualisation and development of the physics related personal epistemology questionnaire (PPEQ), International Journal of Science Education, 10.1080/09500693.2019.1597397, 41, 9, (1207-1227), (2019).
  • BEST PAPER AT SIGCSE 2019 IN THE CS EDUCATION TRACK: First things first: providing metacognitive scaffolding for interpreting problem prompts, ACM Inroads, 10.1145/3324892, 10, 2, (42-49), (2019).
  • Morality in Scientific Practice: The Relevance and Risks of Situated Scientific Knowledge in Application-Oriented Social Research, Human Studies, 10.1007/s10746-018-09491-2, (2019).
  • Secondary Science Teachers’ Reported Practices and Beliefs on Teaching and Learning from a Large National Sample in the United States, Journal of Science Teacher Education, 10.1080/1046560X.2019.1604055, (1-23), (2019).
  • Children’s Perspectives on the Experiences of Their Siblings with Chronic Disorders, Clinical Social Work Journal, 10.1007/s10615-019-00705-3, (2019).
  • Designer portfolio archetypes in design-intensive industries, Industry and Innovation, 10.1080/13662716.2019.1613220, (1-34), (2019).
  • Finnish children’s descriptions of lesbian and heterosexual parents, Nordic Psychology, 10.1080/19012276.2019.1604253, (1-26), (2019).
  • To Agree or Disagree? An Analysis of CSR Ratings Firms, Social and Environmental Accountability Journal, 10.1080/0969160X.2019.1613248, (1-26), (2019).
  • Alternative Media and the Securitization of Climate Change in Turkey, Alternatives: Global, Local, Political, 10.1177/0304375418820384, (030437541882038), (2019).
  • International news media framing of invasive rodent eradications, Biological Invasions, 10.1007/s10530-018-01911-9, (2019).
  • kaphom: An R package for testing the homogeneity of intra-class kappa statistics, Communications in Statistics - Simulation and Computation, 10.1080/03610918.2018.1538457, (1-16), (2019).
  • EASIER: An Evaluation Model for Public–Private Partnerships Contributing to the Sustainable Development Goals, Sustainability, 10.3390/su11082339, 11, 8, (2339), (2019).
  • A Comparative Review of Manifold Learning Techniques for Hyperspectral and Polarimetric SAR Image Fusion, Remote Sensing, 10.3390/rs11060681, 11, 6, (681), (2019).
  • Value adding and non-value adding activities in turnaround maintenance process: classification, validation, and benefits, Production Planning & Control, 10.1080/09537287.2019.1629038, (1-18), (2019).
  • Deep Dive Into Visual Representation and Interrater Agreement Using Data From a High-School Diving Competition, Journal of Statistics Education, 10.1080/10691898.2019.1632759, (1-13), (2019).
  • The Model of Gamification Principles for Digital Health Interventions: Validity and Potential Utility (Preprint), Journal of Medical Internet Research, 10.2196/16506, (2019).
  • Reliability and Construct Validity of the Penetration-Aspiration Scale for Quantifying Pediatric Outcomes after Interarytenoid Augmentation, Otolaryngology–Head and Neck Surgery, 10.1177/0194599819856299, (019459981985629), (2019).
  • Assessing How Gender, Relationship Status, and Item Wording Influence Cues Used by College Students to Decline Different Sexual Behaviors, The Journal of Sex Research, 10.1080/00224499.2019.1659218, (1-13), (2019).
  • Sequentially Determined Measures of Interobserver Agreement (Kappa) in Clinical Trials May Vary Independent of Changes in Observer Performance, Therapeutic Innovation & Regulatory Science, 10.1177/2168479019874059, (216847901987405), (2019).
  • High School Students’ and Scientists’ Experiential Descriptions of Cogenerative Dialogs, International Journal of Science and Mathematics Education, 10.1007/s10763-017-9877-4, 17, 4, (657-677), (2018).
  • How can pictorial representations stimulate the imaginative capacity of experienced multimedia designers?, International Journal of Design Creativity and Innovation, 10.1080/21650349.2018.1465477, 7, 3, (179-192), (2018).
  • Calcified carotid artery atheromas in panoramic radiographs are associated with a first myocardial infarction: a case-control study, Oral Surgery, Oral Medicine, Oral Pathology and Oral Radiology, 10.1016/j.oooo.2017.10.009, 125, 2, (199-204.e1), (2018).
  • Objective coding of content and techniques in workplace-based supervision of an EBT in public mental health, Implementation Science, 10.1186/s13012-017-0708-3, 13, 1, (2018).
  • B 2 FSE framework for high dimensional imbalanced data: A case study for drug toxicity prediction, Neurocomputing, 10.1016/j.neucom.2017.04.081, 276, (31-41), (2018).
  • HBV reactivation in rheumatic diseases patients under therapy: A meta-analysis, Microbial Pathogenesis, 10.1016/j.micpath.2017.12.014, 114, (436-443), (2018).
  • Ordinal-Level Variables, II, The Measurement of Association, 10.1007/978-3-319-98926-6, (297-370), (2018).
  • Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds, BMC Medical Research Methodology, 10.1186/s12874-018-0606-7, 18, 1, (2018).
  • Grading Journals in Economics: The ABCs of the ABDC, SSRN Electronic Journal, 10.2139/ssrn.3258177, (2018).
  • Inter-rater Reliability of Participatory Evaluation Data - Brachiaria Grasses in Nyandarua, SSRN Electronic Journal, 10.2139/ssrn.3274582, (2018).
  • Statistical Assessment of Agreement, Statistical Methods in Social Science Research, 10.1007/978-981-13-2146-7, (61-68), (2018).
  • Complexity in Attitudes Toward Abortion Access: Results from Two Studies, Sexuality Research and Social Policy, 10.1007/s13178-018-0322-4, 15, 4, (464-482), (2018).
  • Unsupervised Clustering and Active Learning of Hyperspectral Images With Nonlinear Diffusion, IEEE Transactions on Geoscience and Remote Sensing, 10.1109/TGRS.2018.2869723, (1-17), (2018).
  • Inter-Annotator Agreement in linguistica: una rassegna critica, Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018, 10.4000/books.aaccademia.2802, (206-211), (2018).
  • undefined, Proceedings of the 22nd Pan-Hellenic Conference on Informatics - PCI '18, 10.1145/3291533.3291563, (288-293), (2018).
  • An Evaluation of Rater Agreement Indices Using Generalizability Theory, Quantitative Psychology, 10.1007/978-3-319-77249-3_7, (77-89), (2018).
  • Imbalanced text sentiment classification using universal and domain-specific knowledge, Knowledge-Based Systems, 10.1016/j.knosys.2018.06.019, 160, (1-15), (2018).
  • undefined, Proceedings of the ACM India Joint International Conference on Data Science and Management of Data - CoDS-COMAD '18, 10.1145/3152494.3152500, (57-66), (2018).
  • Standing the Test of Time: Validating the TEDS-M Knowledge Assessment Against MET II Expectations, Exploring the Mathematical Education of Teachers Using TEDS-M Data, 10.1007/978-3-319-92144-0, (563-579), (2018).
  • Corporate social identity: an analysis of the Indian banking sector, International Journal of Bank Marketing, 10.1108/IJBM-03-2017-0046, 36, 7, (1248-1284), (2018).
  • Potential for social media to challenge gender-based violence in India: a quantitative analysis of Twitter use, Gender & Development, 10.1080/13552074.2018.1473230, 26, 2, (325-339), (2018).
  • New wine in old bottles or new bottles for new wine? Product language approaches in design‐intensive industries during technological turmoil, Creativity and Innovation Management, 10.1111/caim.12259, 27, 2, (133-147), (2018).
  • See more

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.