Volume 36, Issue 2

Measuring pairwise interobserver agreement when all subjects are judged by the same observers

H. J. A. Schouten

*Institute of Biostatistics, Erasmus University Rotterdam.

Search for more papers by this author
First published: June 1982
Citations: 61

Abstract

Abstract An experiment is considered where each of a sample of subjects is rated on an L‐point scale by each of a fixed group of observers. Weighted kappa coefficients are defined to measure the degree of agreement among the observers, between two particular observers, or between a particular observer and the other observers. Attention is paid to the selection of one or more homogeneous subgroups of observers. A linearized Taylor series expansion is used to derive explicit formulas for the computation of large sample standard errors. The procedures are illustrated within the context of a study where seven pathologists separately classified 118 histological slides into five categories.

Number of times cited according to CrossRef: 61

  • Identification of areas of grading difficulties in prostate cancer and comparison with artificial intelligence assisted grading, Virchows Archiv, 10.1007/s00428-020-02858-w, (2020).
  • Indices, Introduction to Interrater Agreement for Nominal Data, 10.1007/978-3-030-11671-2, (81-145), (2019).
  • Episodic viral wheeze and multiple‐trigger wheeze in preschool children are neither distinct nor constant patterns. A prospective multicenter cohort study in secondary care, Pediatric Pulmonology, 10.1002/ppul.24411, 54, 9, (1439-1446), (2019).
  • A paired kappa to compare binary ratings across two medical tests, Statistics in Medicine, 10.1002/sim.8200, 38, 17, (3272-3287), (2019).
  • Development and measurement validity of an instrument for the impact of technology-mediated learning on learning processes, Computers & Education, 10.1016/j.compedu.2018.03.006, 121, (131-142), (2018).
  • Utility of Pathology Imagebase for standardisation of prostate cancer grading, Histopathology, 10.1111/his.13471, 73, 1, (8-18), (2018).
  • Ordinal-Level Variables, II, The Measurement of Association, 10.1007/978-3-319-98926-6, (297-370), (2018).
  • Rating experiments in forestry: How much agreement is there in tree marking?, PLOS ONE, 10.1371/journal.pone.0194747, 13, 3, (e0194747), (2018).
  • Asymptotic variability of (multilevel) multirater kappa coefficients, Statistical Methods in Medical Research, 10.1177/0962280218794733, (096228021879473), (2018).
  • Multiple‐rater kappas for binary data: Models and interpretation, Biometrical Journal, 10.1002/bimj.201600267, 60, 2, (381-394), (2017).
  • Heterogeneous clinical features in patients with pulmonary fibrosis showing histology of pleuroparenchymal fibroelastosis, Respiratory Investigation, 10.1016/j.resinv.2015.11.002, 54, 3, (162-169), (2016).
  • Interrater Reliability for Multiple Raters in Clinical Trials of Ordinal Scale, Drug Information Journal, 10.1177/009286150704100506, 41, 5, (595-605), (2016).
  • Assessing the reliability of ordered categorical scales using kappa-type statistics, Statistical Methods in Medical Research, 10.1191/0962280205sm413oa, 14, 5, (493-514), (2016).
  • Statistical description of interrater variability in ordinal ratings, Statistical Methods in Medical Research, 10.1177/096228020000900505, 9, 5, (475-496), (2016).
  • Modelling patterns of agreement and disagreement, Statistical Methods in Medical Research, 10.1177/096228029200100205, 1, 2, (201-218), (2016).
  • Visual assessment of Ki67 using a 5-grade scale (Eye-5) is easy and practical to classify breast cancer subtypes with high reproducibility, Journal of Clinical Pathology, 10.1136/jclinpath-2014-202695, 68, 5, (356-361), (2015).
  • Beyond 2000, A Chronicle of Permutation Statistical Methods, 10.1007/978-3-319-02744-9, (363-428), (2014).
  • Log‐linear modelling of pairwise interobserver agreement on a categorical scale, Statistics in Medicine, 10.1002/sim.4780110109, 11, 1, (101-114), (2013).
  • The ICI classification for calcaneal injuries: A validation study, Injury, 10.1016/j.injury.2011.09.004, 43, 6, (784-787), (2012).
  • Gastric HER2 Testing Study (GaTHER), The American Journal of Surgical Pathology, 10.1097/PAS.0b013e318244adbb, 36, 4, (577-582), (2012).
  • Hierarchical modeling of agreement, Statistics in Medicine, 10.1002/sim.5424, 31, 28, (3667-3680), (2012).
  • Generalized Agreement Statistics over Fixed Group of Experts, Machine Learning and Knowledge Discovery in Databases, 10.1007/978-3-642-23808-6_13, (191-206), (2011).
  • undefined, 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 10.1109/ISBI.2009.5193271, (1190-1193), (2009).
  • Routine Versus Selective Multidetector-Row Computed Tomography (MDCT) in Blunt Trauma Patients: Level of Agreement on the Influence of Additional Findings on Management, The Journal of Trauma: Injury, Infection, and Critical Care, 10.1097/TA.0b013e318189371d, 67, 5, (1080-1086), (2009).
  • Agreement between Two Independent Groups of Raters, Psychometrika, 10.1007/s11336-009-9116-1, 74, 3, (477-491), (2009).
  • Agreement between an isolated rater and a group of raters, Statistica Neerlandica, 10.1111/j.1467-9574.2008.00412.x, 63, 1, (82-100), (2009).
  • A written test as an alternative to performance testing, Medical Education, 10.1111/j.1365-2923.1989.tb00819.x, 23, 1, (97-107), (2009).
  • Primary Ciliary Dyskinesia: Ciliary Activity, Acta Oto-Laryngologica, 10.3109/00016488609108677, 102, 3-4, (274-281), (2009).
  • WEIGHTED KAPPA FOR MULTIPLE RATERS, Perceptual and Motor Skills, 10.2466/PMS.107.7.837-848, 107, 7, (837), (2008).
  • Weighted Kappa for Multiple Raters, Perceptual and Motor Skills, 10.2466/pms.107.3.837-848, 107, 3, (837-848), (2008).
  • Resampling Probability Values for Weighted Kappa with Multiple Raters, Psychological Reports, 10.2466/pr0.102.2.606-613, 102, 2, (606-613), (2008).
  • Recognition of veneer restorations by dentists and beautician students, Journal of Oral Rehabilitation, 10.1111/j.1365-2842.1997.tb00365.x, 24, 7, (506-511), (2008).
  • Assessment of general movements: towards a better understanding of a sensitive method to evaluate brain function in young infants, Developmental Medicine & Child Neurology, 10.1111/j.1469-8749.1997.tb07390.x, 39, 2, (88-98), (2008).
  • AGREEMENT ON A TWO–POINT SCALE, Statistica Neerlandica, 10.1111/j.1467-9574.1986.tb01160.x, 40, 1, (13-20), (2008).
  • BIOSTATISTICS IN MEDICINE, Statistica Neerlandica, 10.1111/j.1467-9574.1985.tb01138.x, 39, 2, (191-201), (2008).
  • A new approach to inter-rater agreement through stochastic orderings: the discrete case, Metrika, 10.1007/s00184-007-0137-4, 67, 3, (349-370), (2007).
  • Kappa, Wiley Encyclopedia of Clinical Trials, 10.1002/9780471462422, (1-7), (2007).
  • Interobserver reliability of visual interpretation of electroencephalograms in children with newly diagnosed seizures, Developmental Medicine & Child Neurology, 10.1017/S0012162206000806, 48, 5, (374-377), (2007).
  • Sudden unexpected death in infancy: epidemiologically determined risk factors related to pathological classification, Acta Paediatrica, 10.1111/j.1651-2227.1998.tb00952.x, 87, 12, (1279-1287), (2007).
  • Diagnosis of IgE‐mediated allergy in the upper respiratory tract, Allergy, 10.1111/j.1398-9995.1991.tb00551.x, 46, 2, (99-104), (2007).
  • Usefulness of an Aura for Classification of a First Generalized Seizure, Epilepsia, 10.1111/j.1528-1157.1990.tb06102.x, 31, 5, (529-535), (2007).
  • Reliability Study on the Japanese Version of the Clinician’s Interview-Based Impression of Change, Dementia and Geriatric Cognitive Disorders, 10.1159/000097596, 23, 2, (104-115), (2006).
  • Reliability Study on the Japanese Version of the Clinician’s Interview-Based Impression of Change, Dementia and Geriatric Cognitive Disorders, 10.1159/000090296, 21, 2, (97-103), (2006).
  • Measuring interrater reliability among multiple raters: An example of methods for nominal data, Statistics in Medicine, 10.1002/sim.4780090917, 9, 9, (1103-1115), (2006).
  • 873 questions of Dutch dental patients: a challenge to dental health education, Community Dentistry and Oral Epidemiology, 10.1111/j.1600-0528.1984.tb01461.x, 12, 5, (308-314), (2006).
  • Reliability of assessment of adherence to an antimicrobial treatment guideline, Journal of Hospital Infection, 10.1016/j.jhin.2004.11.022, 60, 4, (321-328), (2005).
  • Quantitative Assessment of Regional Myocardial Function with MR-Tagging in a Multi-Center Study: Interobserver and Intraobserver Agreement of Fast Strain Analysis with Harmonic Phase (HARP) MRI, Journal of Cardiovascular Magnetic Resonance, 10.1080/10976640500295417, 7, 5, (783-791), (2005).
  • Kappa, Encyclopedia of Biostatistics, 10.1002/0470011815, (2005).
  • Interrater agreement in the assessment of motor manifestations of Huntington's disease, Movement Disorders, 10.1002/mds.20332, 20, 3, (293-297), (2004).
  • Nominal Scale Agreement, Encyclopedia of Statistical Sciences, 10.1002/0471667196, (2004).
  • Reproducibility of clinical data II: categorical outcomes, Pharmaceutical Statistics, 10.1002/pst.105, 3, 2, (109-122), (2004).
  • A general goodness‐of‐fit approach for inference procedures concerning the kappa statistic, Statistics in Medicine, 10.1002/sim.911, 20, 16, (2479-2488), (2001).
  • Can electronic medical images replace hard‐copy film? Defining and testing the equivalence of diagnostic tests, Statistics in Medicine, 10.1002/sim.929, 20, 19, (2845-2863), (2001).
  • Reproducibility of Measures of Overuse of Cataract Surgery by Three Physician Panels, Medical Care, 10.1097/00005650-199909000-00009, 37, 9, (937-945), (1999).
  • Overeenstemming bij Medische Beoordelingen, Klinische statistiek, 10.1007/978-90-313-9661-0, (33-38), (1999).
  • Grading medial collateral ligament injury: comparison of MR imaging and instrumented valgus-varus laxity test-device. A prospective double-blind patient study, European Journal of Radiology, 10.1016/0720-048X(95)00660-I, 21, 1, (18-24), (1995).
  • Inter- and intra-observer agreement in the assessment of the quality of spontaneous movements in the newborn, Brain and Development, 10.1016/S0387-7604(12)80145-8, 14, 5, (289-293), (1992).
  • Interobserver agreement in assessment of vestibulo-ocular responses., Journal of Neurology, Neurosurgery & Psychiatry, 10.1136/jnnp.50.8.1045, 50, 8, (1045-1047), (1987).
  • Nominal scale agreement among observers, Psychometrika, 10.1007/BF02294066, 51, 3, (453-466), (1986).
  • Problems in grading of prostatic carcinoma: interobserver reproducibility of five different grading systems, World Journal of Urology, 10.1007/BF00327011, 4, 3, (147-152), (1986).
  • Agreement between physicians on assessment of outcome following severe head injury, Journal of Neurosurgery, 10.3171/jns.1983.58.3.0321, 58, 3, (321-325), (1983).

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.