Single nucleotide differences (SNDs) in the dbSNP database may lead to errors in genotyping and haplotyping studies

Authors

  • Lucia Musumeci,

    1. Plunkett Chair of Molecular Biology (Medicine), Bosch Institute, The University of Sydney, Medical Foundation Building (K25), Camperdown, NSW 2006, Australia
    Current affiliation:
    1. Immunology and Infectious Diseases Unit, GIGA-R, Liège University, Liège, Belgium
    Search for more papers by this author
    • The first two authors contributed equally to this article.

  • Jonathan W. Arthur,

    1. Discipline of Medicine, Sydney Medical School, The University of Sydney, Camperdown, NSW 2006, Australia
    2. Sydney Bioinformatics, The University of Sydney, Camperdown, NSW 2006, Australia
    Search for more papers by this author
    • The first two authors contributed equally to this article.

  • Florence S. G. Cheung,

    1. Plunkett Chair of Molecular Biology (Medicine), Bosch Institute, The University of Sydney, Medical Foundation Building (K25), Camperdown, NSW 2006, Australia
    Search for more papers by this author
  • Ashraful Hoque,

    1. The University of Texas M.D. Anderson Cancer Center, Houston, Texas
    Search for more papers by this author
  • Scott Lippman,

    1. The University of Texas M.D. Anderson Cancer Center, Houston, Texas
    Search for more papers by this author
  • Juergen K. V. Reichardt

    Corresponding author
    1. Plunkett Chair of Molecular Biology (Medicine), Bosch Institute, The University of Sydney, Medical Foundation Building (K25), Camperdown, NSW 2006, Australia
    • University of Sydney, Plunkett Chair of Molecular Biology (Medicine), Medical Foundation Building (K25), 92–94 Parramatta Road, Camperdown, New South Wales, 2006 Australia
    Search for more papers by this author

  • Communicated by Ian N.M. Day

Abstract

The creation of single nucleotide polymorphism (SNP) databases (such as NCBI dbSNP) has facilitated scientific research in many fields. SNP discovery and detection has improved to the extent that there are over 17 million human reference (rs) SNPs reported to date (Build 129 of dbSNP). SNP databases are unfortunately not always complete and/or accurate. In fact, half of the reported SNPs are still only candidate SNPs and are not validated in a population. We describe the identification of SNDs (single nucleotide differences) in humans, that may contaminate the dbSNP database. These SNDs, reported as real SNPs in the database, do not exist as such, but are merely artifacts due to the presence of a paralogue (highly similar duplicated) sequence in the genome. Using sequencing we showed how SNDs could originate in two paralogous genes and evaluated samples from a population of 100 individuals for the presence/absence of SNPs. Moreover, using bioinformatics, we predicted as many as 8.32% of the biallelic, coding SNPs in the dbSNP database to be SNDs. Our identification of SNDs in the database will allow researchers to not only select truly informative SNPs for association studies, but also aid in determining accurate SNP genotypes and haplotypes. Hum Mutat 31:67–73, 2010. © 2009 Wiley-Liss, Inc.

Ancillary