Measuring pairwise interobserver agreement when all subjects are judged by the same observers
Abstract
Abstract An experiment is considered where each of a sample of subjects is rated on an L‐point scale by each of a fixed group of observers. Weighted kappa coefficients are defined to measure the degree of agreement among the observers, between two particular observers, or between a particular observer and the other observers. Attention is paid to the selection of one or more homogeneous subgroups of observers. A linearized Taylor series expansion is used to derive explicit formulas for the computation of large sample standard errors. The procedures are illustrated within the context of a study where seven pathologists separately classified 118 histological slides into five categories.
Citing Literature
Number of times cited according to CrossRef: 61
- Lars Egevad, Daniela Swanberg, Brett Delahunt, Peter Ström, Kimmo Kartasalo, Henrik Olsson, Dan M. Berney, David G. Bostwick, Andrew J. Evans, Peter A. Humphrey, Kenneth A. Iczkowski, James G. Kench, Glen Kristiansen, Katia R. M. Leite, Jesse K. McKenney, Jon Oxley, Chin-Chen Pan, Hemamali Samaratunga, John R. Srigley, Hiroyuki Takahashi, Toyonori Tsuzuki, Theo van der Kwast, Murali Varma, Ming Zhou, Mark Clements, Martin Eklund, Identification of areas of grading difficulties in prostate cancer and comparison with artificial intelligence assisted grading, Virchows Archiv, 10.1007/s00428-020-02858-w, (2020).
- Roel Popping, Roel Popping, Indices, Introduction to Interrater Agreement for Nominal Data, 10.1007/978-3-030-11671-2, (81-145), (2019).
- Maud J. A. Raaymakers, Paul L. P. Brand, Anneke M. Landstra, Marianne L. Brouwer, Walter A. F. Balemans, Laetitia E. M. Niers, Peter J. F. M. Merkus, Annemie L. M. Boehmer, Jan A. J. W. Kluytmans, Johan C. Jongste, Marielle W. H. Pijnenburg, Anja A. P. H. Vaessen‐Verberne, Episodic viral wheeze and multiple‐trigger wheeze in preschool children are neither distinct nor constant patterns. A prospective multicenter cohort study in secondary care, Pediatric Pulmonology, 10.1002/ppul.24411, 54, 9, (1439-1446), (2019).
- Kerrie P. Nelson, Don Edwards, A paired kappa to compare binary ratings across two medical tests, Statistics in Medicine, 10.1002/sim.8200, 38, 17, (3272-3287), (2019).
- Sufen Wang, Kejing Zhang, Ming Du, Zhijun Wang, Development and measurement validity of an instrument for the impact of technology-mediated learning on learning processes, Computers & Education, 10.1016/j.compedu.2018.03.006, 121, (131-142), (2018).
- Lars Egevad, Brett Delahunt, Daniel M Berney, David G Bostwick, John Cheville, Eva Comperat, Andrew J Evans, Samson W Fine, David J Grignon, Peter A Humphrey, Jonas Hörnblad, Kenneth A Iczkowski, James G Kench, Glen Kristiansen, Katia R M Leite, Cristina Magi‐Galluzzi, Jesse K McKenney, Jon Oxley, Chin‐Chen Pan, Hemamali Samaratunga, John R Srigley, Hiroyuki Takahashi, Lawrence D True, Toyonori Tsuzuki, Theo Kwast, Murali Varma, Ming Zhou, Mark Clements, Utility of Pathology Imagebase for standardisation of prostate cancer grading, Histopathology, 10.1111/his.13471, 73, 1, (8-18), (2018).
- Kenneth J. Berry, Janis E. Johnston, Paul W. Mielke, Kenneth J. Berry, Janis E. Johnston, Paul W. Mielke,, Ordinal-Level Variables, II, The Measurement of Association, 10.1007/978-3-319-98926-6, (297-370), (2018).
- Arne Pommerening, Carlos Pallarés Ramos, Wojciech Kędziora, Jens Haufe, Dietrich Stoyan, Rating experiments in forestry: How much agreement is there in tree marking?, PLOS ONE, 10.1371/journal.pone.0194747, 13, 3, (e0194747), (2018).
- Sophie Vanbelle, Asymptotic variability of (multilevel) multirater kappa coefficients, Statistical Methods in Medical Research, 10.1177/0962280218794733, (096228021879473), (2018).
- Dietrich Stoyan, Arne Pommerening, Manuela Hummel, Annette Kopp‐Schneider, Multiple‐rater kappas for binary data: Models and interpretation, Biometrical Journal, 10.1002/bimj.201600267, 60, 2, (381-394), (2017).
- Yuji Yoshida, Nobuhiko Nagata, Nobuko Tsuruta, Yasuhiko Kitasato, Kentaro Wakamatsu, Michihiro Yoshimi, Hiroshi Ishii, Takako Hirota, Naoki Hamada, Masaki Fujita, Kazuki Nabeshima, Fumiaki Kiyomi, Kentaro Watanabe, Heterogeneous clinical features in patients with pulmonary fibrosis showing histology of pleuroparenchymal fibroelastosis, Respiratory Investigation, 10.1016/j.resinv.2015.11.002, 54, 3, (162-169), (2016).
- Jungo Sawa, Toshihiko Morikawa, Interrater Reliability for Multiple Raters in Clinical Trials of Ordinal Scale, Drug Information Journal, 10.1177/009286150704100506, 41, 5, (595-605), (2016).
- Chris Roberts, Roseanne McNamee, Assessing the reliability of ordered categorical scales using kappa-type statistics, Statistical Methods in Medical Research, 10.1191/0962280205sm413oa, 14, 5, (493-514), (2016).
- Jennifer C Nelson, Margaret S Pepe, Statistical description of interrater variability in ordinal ratings, Statistical Methods in Medical Research, 10.1177/096228020000900505, 9, 5, (475-496), (2016).
- Alan Agresti, Modelling patterns of agreement and disagreement, Statistical Methods in Medical Research, 10.1177/096228029200100205, 1, 2, (201-218), (2016).
- Akira I Hida, Kenji Bando, Atsuro Sugita, Toshiharu Maeda, Norifumi Ueda, Shoichi Matsukage, Mamoru Nakanishi, Katsumi Kito, Tatsuhiko Miyazaki, Yuji Ohtsuki, Yumi Oshiro, Hiromichi Inoue, Hidetoshi Kawaguchi, Natsumi Yamashita, Kenjiro Aogi, Takuya Moriya, Visual assessment of Ki67 using a 5-grade scale (Eye-5) is easy and practical to classify breast cancer subtypes with high reproducibility, Journal of Clinical Pathology, 10.1136/jclinpath-2014-202695, 68, 5, (356-361), (2015).
- Kenneth J. Berry, Janis E. Johnston, Paul W. Mielke, Kenneth J. Berry, Janis E. Johnston, Paul W. Mielke, Beyond 2000, A Chronicle of Permutation Statistical Methods, 10.1007/978-3-319-02744-9, (363-428), (2014).
- Mark P. Becker, Alan Agresti, Log‐linear modelling of pairwise interobserver agreement on a categorical scale, Statistics in Medicine, 10.1002/sim.4780110109, 11, 1, (101-114), (2013).
- Herman Frima, Rienk Eshuis, Paul Mulder, Luke Leenen, The ICI classification for calcaneal injuries: A validation study, Injury, 10.1016/j.injury.2011.09.004, 43, 6, (784-787), (2012).
- Stephen B. Fox, Marian Priyanthi Kumarasinghe, Jane E. Armes, Michael Bilous, Margaret C. Cummings, Gelareh Farshid, Nicole Fitzpatrick, Glenn D. Francis, Philip I. McCloud, Wendy Raymond, Adrienne Morey, Gastric HER2 Testing Study (GaTHER), The American Journal of Surgical Pathology, 10.1097/PAS.0b013e318244adbb, 36, 4, (577-582), (2012).
- Sophie Vanbelle, Timothy Mutsvari, Dominique Declerck, Emmanuel Lesaffre, Hierarchical modeling of agreement, Statistics in Medicine, 10.1002/sim.5424, 31, 28, (3667-3680), (2012).
- Mohak Shah, Generalized Agreement Statistics over Fixed Group of Experts, Machine Learning and Knowledge Discovery in Databases, 10.1007/978-3-642-23808-6_13, (191-206), (2011).
- Liliane Ramus, Gregoire Malandain, undefined, 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 10.1109/ISBI.2009.5193271, (1190-1193), (2009).
- Jaap Deunk, Monique Brink, Helena M. Dekker, Digna R. Kool, Johan G. Blickman, Arie B. van Vugt, Michael J. Edwards, Routine Versus Selective Multidetector-Row Computed Tomography (MDCT) in Blunt Trauma Patients: Level of Agreement on the Influence of Additional Findings on Management, The Journal of Trauma: Injury, Infection, and Critical Care, 10.1097/TA.0b013e318189371d, 67, 5, (1080-1086), (2009).
- Sophie Vanbelle, Adelin Albert, Agreement between Two Independent Groups of Raters, Psychometrika, 10.1007/s11336-009-9116-1, 74, 3, (477-491), (2009).
- S. Vanbelle, A. Albert, Agreement between an isolated rater and a group of raters, Statistica Neerlandica, 10.1111/j.1467-9574.2008.00412.x, 63, 1, (82-100), (2009).
- C. P. M. VLEUTEN, S. J. LUYK, H. J. M. BECKERS, A written test as an alternative to performance testing, Medical Education, 10.1111/j.1365-2923.1989.tb00819.x, 23, 1, (97-107), (2009).
- S. Van Der Baan, A. J. P. Veerman, N. Wulffraat, P. D. Bezemer, L. Feenstra, Primary Ciliary Dyskinesia: Ciliary Activity, Acta Oto-Laryngologica, 10.3109/00016488609108677, 102, 3-4, (274-281), (2009).
- KENNETH J. BERRY, WEIGHTED KAPPA FOR MULTIPLE RATERS, Perceptual and Motor Skills, 10.2466/PMS.107.7.837-848, 107, 7, (837), (2008).
- Kenneth J. Berry, Janis E. Johnston, Paul W. Mielke, Weighted Kappa for Multiple Raters, Perceptual and Motor Skills, 10.2466/pms.107.3.837-848, 107, 3, (837-848), (2008).
- Paul W. Mielke, Kenneth J. Berry, Janis E. Johnston, Resampling Probability Values for Weighted Kappa with Multiple Raters, Psychological Reports, 10.2466/pr0.102.2.606-613, 102, 2, (606-613), (2008).
- A.C. MEIJERING, F.J.M. ROETERS, J. MULDER, N.H.J. CREUGERS, Recognition of veneer restorations by dentists and beautician students, Journal of Oral Rehabilitation, 10.1111/j.1365-2842.1997.tb00365.x, 24, 7, (506-511), (2008).
- Mijna Hadders‐Algra, Annekt WJ Klip‐ Nieuwendijk, Albert Maitijn, Leo A. Eykern, Assessment of general movements: towards a better understanding of a sensitive method to evaluate brain function in young infants, Developmental Medicine & Child Neurology, 10.1111/j.1469-8749.1997.tb07390.x, 39, 2, (88-98), (2008).
- V. Fidler, N.J.D. Nagelkerke, AGREEMENT ON A TWO–POINT SCALE, Statistica Neerlandica, 10.1111/j.1467-9574.1986.tb01160.x, 40, 1, (13-20), (2008).
- R. Strik, R. Stnk, P.D. Bezemer, BIOSTATISTICS IN MEDICINE, Statistica Neerlandica, 10.1111/j.1467-9574.1985.tb01138.x, 39, 2, (191-201), (2008).
- Alessandra Giovagnoli, Johnny Marzialetti, Henry P. Wynn, A new approach to inter-rater agreement through stochastic orderings: the discrete case, Metrika, 10.1007/s00184-007-0137-4, 67, 3, (349-370), (2007).
- Richard J. Cook, Kappa, Wiley Encyclopedia of Clinical Trials, 10.1002/9780471462422, (1-7), (2007).
- Hans Stroink, Robbert‐Jan Schimsheimer, Al W Weerd, Ada T Geerts, Willem F Arts, Erasmus MC, Els A Peeters, Oebele F Brouwer, A Boudewijn Peters, Cees A Donselaar, Interobserver reliability of visual interpretation of electroencephalograms in children with newly diagnosed seizures, Developmental Medicine & Child Neurology, 10.1017/S0012162206000806, 48, 5, (374-377), (2007).
- MP L'Hoir, AC Engelberts, GThJ Well, T Bajanowski, K Helweg‐Larsen, J Huber, Sudden unexpected death in infancy: epidemiologically determined risk factors related to pathological classification, Acta Paediatrica, 10.1111/j.1651-2227.1998.tb00952.x, 87, 12, (1279-1287), (2007).
- G. J. Blok, D. C. Flikweert, J. J. P. Nauta, J. A. Leezenberg, A. M. Snel, S. Baan, Diagnosis of IgE‐mediated allergy in the upper respiratory tract, Allergy, 10.1111/j.1398-9995.1991.tb00551.x, 46, 2, (99-104), (2007).
- Cees A. Donselaar, Ada T. Geerts, Robert‐Jan Schimsheimer, Usefulness of an Aura for Classification of a First Generalized Seizure, Epilepsia, 10.1111/j.1528-1157.1990.tb06102.x, 31, 5, (529-535), (2007).
- Yu Nakamura, Akira Homma, Shinichi Kobune, Yosuke Tachibana, Keizo Satoh, Isao Takami, Shinji Nagai, Masanao Sakai, Hiroshi Fukuta, Hiroaki Matsuda, Hideaki Hashimoto, Tadashi Kusunoki, Reliability Study on the Japanese Version of the Clinician’s Interview-Based Impression of Change, Dementia and Geriatric Cognitive Disorders, 10.1159/000097596, 23, 2, (104-115), (2006).
- Akira Homma, Yu Nakamura, Shinichi Kobune, Hirofumi Haraguchi, Nobuyoshi Kodani, Isao Takami, Joe Matsuoka, Hiroaki Matsuda, Tadashi Kusunoki, Reliability Study on the Japanese Version of the Clinician’s Interview-Based Impression of Change, Dementia and Geriatric Cognitive Disorders, 10.1159/000090296, 21, 2, (97-103), (2006).
- Karen L. Posner, Paul D. Sampson, Robert A. Caplan, Richard J. Ward, Frederick W. Cheney, Measuring interrater reliability among multiple raters: An example of methods for nominal data, Statistics in Medicine, 10.1002/sim.4780090917, 9, 9, (1103-1115), (2006).
- M. A. J. Eijkman, C. B. M. Riel, R. J. Dijk, 873 questions of Dutch dental patients: a challenge to dental health education, Community Dentistry and Oral Epidemiology, 10.1111/j.1600-0528.1984.tb01461.x, 12, 5, (308-314), (2006).
- P.G.M. Mol, R.O.B. Gans, P.V. Nannan Panday, J.E. Degener, M. Laseur, F.M. Haaijer-Ruskamp, Reliability of assessment of adherence to an antimicrobial treatment guideline, Journal of Hospital Infection, 10.1016/j.jhin.2004.11.022, 60, 4, (321-328), (2005).
- Ernesto Castillo, Nael Osman, Boaz Rosen, Iman El-Shehaby, Li Pan, Michael Jerosch-Herold, Shenghan Lai, David Bluemke, João Lima, Quantitative Assessment of Regional Myocardial Function with MR-Tagging in a Multi-Center Study: Interobserver and Intraobserver Agreement of Fast Strain Analysis with Harmonic Phase (HARP) MRI, Journal of Cardiovascular Magnetic Resonance, 10.1080/10976640500295417, 7, 5, (783-791), (2005).
- Richard J. Cook, Kappa, Encyclopedia of Biostatistics, 10.1002/0470011815, (2005).
- Penelope Hogarth, Elise Kayson, Karl Kieburtz, Karen Marder, David Oakes, Diana Rosas, Ira Shoulson, Nancy S. Wexler, Anne B. Young, Hongwei Zhao, Interrater agreement in the assessment of motor manifestations of Huntington's disease, Movement Disorders, 10.1002/mds.20332, 20, 3, (293-297), (2004).
- Roel Popping, Samuel Kotz, Campbell B. Read, N. Balakrishnan, Brani Vidakovic, Norman L. Johnson, Nominal Scale Agreement, Encyclopedia of Statistical Sciences, 10.1002/0471667196, (2004).
- Nicole Jill‐Marie Blackman, Reproducibility of clinical data II: categorical outcomes, Pharmaceutical Statistics, 10.1002/pst.105, 3, 2, (109-122), (2004).
- Mekibib Altaye, Allan Donner, Michael Eliasziw, A general goodness‐of‐fit approach for inference procedures concerning the kappa statistic, Statistics in Medicine, 10.1002/sim.911, 20, 16, (2479-2488), (2001).
- Nancy A. Obuchowski, Can electronic medical images replace hard‐copy film? Defining and testing the equivalence of diagnostic tests, Statistics in Medicine, 10.1002/sim.929, 20, 19, (2845-2863), (2001).
- Joanne K. Tobacman, Ingrid U. Scott, Stacey Cyphert, Bridget Zimmerman, Reproducibility of Measures of Overuse of Cataract Surgery by Three Physician Panels, Medical Care, 10.1097/00005650-199909000-00009, 37, 9, (937-945), (1999).
- Hubert J. A. Schouten, Hubert J. A. Schouten, Overeenstemming bij Medische Beoordelingen, Klinische statistiek, 10.1007/978-90-313-9661-0, (33-38), (1999).
- Elise I.J. Rasenberg, J. Albert M. Lemmens, Albert van Kampen, Frans Schoots, Hans J.K.C. Bloo, Harry P.A. Wagemakers, Leendert Blankevoort, Grading medial collateral ligament injury: comparison of MR imaging and instrumented valgus-varus laxity test-device. A prospective double-blind patient study, European Journal of Radiology, 10.1016/0720-048X(95)00660-I, 21, 1, (18-24), (1995).
- Vivianne van Kranen-Mastenbroek, Robert van Oostenbrugge, Liesbeth Palmans, Anita Stevens, Herman Kingma, Carlos Blanco, Tom Hasaart, Johan Vles, Inter- and intra-observer agreement in the assessment of the quality of spontaneous movements in the newborn, Brain and Development, 10.1016/S0387-7604(12)80145-8, 14, 5, (289-293), (1992).
- J H van den Berge, R Braakman, H J Schouten, Interobserver agreement in assessment of vestibulo-ocular responses., Journal of Neurology, Neurosurgery & Psychiatry, 10.1136/jnnp.50.8.1045, 50, 8, (1045-1047), (1987).
- Hubert J. A. Schouten, Nominal scale agreement among observers, Psychometrika, 10.1007/BF02294066, 51, 3, (453-466), (1986).
- Fiebo J. W. ten Kate, Maarten P. W. Gallee, Paul I. M. Schmitz, Adriaan C. Joebsis, Roy O. van der Heul, M. Eric F. Prins, Jan H. M. Blom, Problems in grading of prostatic carcinoma: interobserver reproducibility of five different grading systems, World Journal of Urology, 10.1007/BF00327011, 4, 3, (147-152), (1986).
- Andrew I. R. Maas, Reinder Braakman, Hubert J. A. Schouten, Jan M. Minderhoud, Adriaan H. van Zomeren, Agreement between physicians on assessment of outcome following severe head injury, Journal of Neurosurgery, 10.3171/jns.1983.58.3.0321, 58, 3, (321-325), (1983).




