Probabilistic linkage of large public health data files
Article first published online: 28 FEB 2007
Copyright © 1995 John Wiley & Sons, Ltd.
Statistics in Medicine
Volume 14, Issue 5-7, pages 491–498, 15 March - 15 April 1995
How to Cite
Jaro, M. A. (1995), Probabilistic linkage of large public health data files. Statist. Med., 14: 491–498. doi: 10.1002/sim.4780140510
- Issue published online: 28 FEB 2007
- Article first published online: 28 FEB 2007
Probabilistic linkage technology makes it feasible and efficient to link large public health databases in a statistically justifiable manner. The problem addressed by the methodology is that of matching two files of individual data under conditions of uncertainty. Each field is subject to error which is measured by the probability that the field agrees given a record pair matches (called the m probability) and probabilities of chance agreement of its value states (called the u probability). Fellegi and Sunter pioneered record linkage theory. Advances in methodology include use of an EM algorithm for parameter estimation, optimization of matches by means of a linear sum assignment program, and more recently, a probability model that addresses both m and u probabilities for all value states of a field. This provides a means for obtaining greater precision from non-uniformly distributed fields, without the theoretical complications arising from frequency-based matching alone. The model includes an interative parameter estimation procedure that is more robust than pre-match estimation techniques. The methodology was originally developed and tested by the author at the U.S. Census Bureau for census undercount estimation. The more recent advances and a new generalized software system were tested and validated by linking highway crashes to Emergency Medical Service (EMS) reports and to hospital admission records for the National Highway Traffic Safety Administration (NHTSA).