Identification of association between disease and multiple markers via sparse partial least-squares regression

Authors

  • Hyonho Chun,

    Corresponding author
    1. Department of Epidemiology and Public Health, Yale University, New Haven, Connecticut
    • Department of Epidemiology and Public Health, Yale University, 300 George Street ♯503, New Haven 06511, CT
    Search for more papers by this author
  • David H. Ballard,

    1. Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, Manhasset, New York
    Search for more papers by this author
  • Judy Cho,

    1. Department of Genetics, Yale University, New Haven, Connecticut
    2. IBD Center, Division of Gastroenterology, Department of Medicine, Yale University, New Haven, Connecticut
    Search for more papers by this author
  • Hongyu Zhao

    1. Department of Epidemiology and Public Health, Yale University, New Haven, Connecticut
    2. Department of Genetics, Yale University, New Haven, Connecticut
    Search for more papers by this author

Abstract

Although genome-wide association studies have led to the identifications of hundreds of genes underlying dozens of traits in recent years, most published studies have primarily used single marker-based analysis. Intuitively, more information may be utilized when multiple markers are jointly analyzed. Therefore, many methods have been proposed in the literature for association analysis between traits and multiple markers. Among these methods, simulation and real data analyses have shown that it is often more effective to reduce the dimensionality of the markers in a region through principal components analysis of all the markers first, and then to perform association analysis between traits and those principal components that account for most of the genetic variations in the region. However, one major limitation of this approach is that the principal components are derived purely from marker genotypes, without consideration of their relevance to traits. Furthermore, these components are constructed as linear combinations of all the markers even when only a limited number are potentially relevant to traits. In this manuscript, we propose the use of sparse partial least-squares regression to derive the components that are linear combinations of only relevant markers. This approach is able to use information from both traits and marker genotypes. Extensive simulations and real data analyses on a Crohn's disease data set suggest the superiority of this approach over existing methods. Genet. Epidemiol. 2011. © 2011 Wiley-Liss, Inc. 35: 479-486, 2011

Ancillary