Efficient Genomewide Selection of PCA-Correlated tSNPs for Genotype Imputation

Authors

  • Asif Javed,

    Corresponding author
    1. Computational Biology Center, IBM T. J. Watson Research, Yorktown Heights, NY 10598, USA
    2. Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA
      This work was completed while the author was a Ph.D. student at Rensselaer Polytechnic Institute.
    Search for more papers by this author
  • Petros Drineas,

    1. Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA
    Search for more papers by this author
  • Michael W. Mahoney,

    1. Department of Mathematics, Stanford University, Palo Alto, CA 94305, USA
    Search for more papers by this author
  • Peristera Paschou

    Corresponding author
    1. Department of Molecular Biology and Genetics, Democritus University of Thrace, Alexandroupoli 68100, Greece
      Peristera Paschou, Department of Molecular Biology and Genetics, Democritus University of Thrace, Alexandroupoli 68100, Greece. Tel: +302551030658; Fax: +302551030613; E-mail: ppaschou@mbg.duth.gr
    Search for more papers by this author

This work was completed while the author was a Ph.D. student at Rensselaer Polytechnic Institute.

Peristera Paschou, Department of Molecular Biology and Genetics, Democritus University of Thrace, Alexandroupoli 68100, Greece. Tel: +302551030658; Fax: +302551030613; E-mail: ppaschou@mbg.duth.gr

Summary

The linkage disequilibrium structure of the human genome allows identification of small sets of single nucleotide polymorphisms (SNPs) (tSNPs) that efficiently represent dense sets of markers. This structure can be translated into linear algebraic terms as evidenced by the well documented principal components analysis (PCA)-based methods. Here we apply, for the first time, PCA-based methodology for efficient genomewide tSNP selection; and explore the linear algebraic structure of the human genome. Our algorithm divides the genome into contiguous nonoverlapping windows of high linear structure. Coupling this novel window definition with a PCA-based tSNP selection method, we analyze 2.5 million SNPs from the HapMap phase 2 dataset. We show that 10–25% of these SNPs suffice to predict the remaining genotypes with over 95% accuracy. A comparison with other popular methods in the ENCODE regions indicates significant genotyping savings. We evaluate the portability of genome-wide tSNPs across a diverse set of populations (HapMap phase 3 dataset). Interestingly, African populations are good reference populations for the rest of the world. Finally, we demonstrate the applicability of our approach in a real genome-wide disease association study. The chosen tSNP panels can be used toward genotype imputation using either a simple regression-based algorithm or more sophisticated genotype imputation methods.

Ancillary