Haplotype analysis in the presence of informatively missing genotype data



It is common to have missing genotypes in practical genetic studies, but the exact underlying missing data mechanism is generally unknown to the investigators. Although some statistical methods can handle missing data, they usually assume that genotypes are missing at random, that is, at a given marker, different genotypes and different alleles are missing with the same probability. These include those methods on haplotype frequency estimation and haplotype association analysis. However, it is likely that this simple assumption does not hold in practice, yet few studies to date have examined the magnitude of the effects when this simplifying assumption is violated. In this study, we demonstrate that the violation of this assumption may lead to serious bias in haplotype frequency estimates, and haplotype association analysis based on this assumption can induce both false-positive and false-negative evidence of association. To address this limitation in the current methods, we propose a general missing data model to characterize missing data patterns across a set of two or more markers simultaneously. We prove that haplotype frequencies and missing data probabilities are identifiable if and only if there is linkage disequilibrium between these markers under our general missing data model. Simulation studies on the analysis of haplotypes consisting of two single nucleotide polymorphisms illustrate that our proposed model can reduce the bias both for haplotype frequency estimates and association analysis due to incorrect assumption on the missing data mechanism. Finally, we illustrate the utilities of our method through its application to a real data set. Genet. Epidemiol. 2006. © 2006 Wiley-Liss, Inc.