Adjustment for Population Stratification via Principal Components in Association Analysis of Rare Variants


Correspondence to: Wei Pan, Division of Biostatistics, MMC 303, School of Public Health, University of Minnesota, Minneapolis, MN 55455-0392. E-mail:


For unrelated samples, principal component (PC) analysis has been established as a simple and effective approach to adjusting for population stratification in association analysis of common variants (CVs, with minor allele frequencies MAF > 5%). However, it is less clear how it would perform in analysis of low-frequency variants (LFVs, MAF between 1% and 5%), or of rare variants (RVs, MAF < 5%). Furthermore, with next-generation sequencing data, it is unknown whether PCs should be constructed based on CVs, LFVs, or RVs. In this study, we used the 1000 Genomes Project sequence data to explore the construction of PCs and their use in association analysis of LFVs or RVs for unrelated samples. It is shown that a few top PCs based on either CVs or LFVs could separate two continental groups, European and African samples, but those based on only RVs performed less well. When applied to several association tests in simulated data with population stratification, using PCs based on either CVs or LFVs was effective in controlling Type I error rates, while nonadjustment led to inflated Type I error rates. Perhaps the most interesting observation is that, although the PCs based on LFVs could better separate the two continental groups than those based on CVs, the use of the former could lead to overadjustment in the sense of substantial power loss in the absence of population stratification; in contrast, we did not see any problem with the use of the PCs based on CVs in all our examples.