• population substructure;
  • outlier detection;
  • GWAS;
  • sequence data


For the analysis of rare-variant data in population-based designs, we propose a method to detect study subjects that may create population substructure in the study sample. Our approach is computationally fast and simple, permitting applications to whole-genome sequencing studies. The method does not require the variants to be in linkage equilibrium and can be applied to all the genetic loci that are available in the study. For both rare and common variants, we assess the performance of our approach by its application to the 1000 Genome Project data, and in simulation studies. The results are compared to the commonly used outlier detection algorithm based on principal component analysis (PCA). The statistical power of both approaches to detect outliers are comparable in most of the scenarios, but the power of PCA to detect outliers is lower than the novel approach in the presence of linkage disequilibrium and for subpopulations that are genetically similar. The data analysis and the simulation studies suggest that the number of false-positive results appears to be different for the two approaches. Our approach maintains the type I error rate while the outlier detection approach based on PCA does not. Taking additionally into account the minimal computational requirements of our approach and the ability to incorporate all the marker information, the proposed method will have important application in sequencing studies and genome-wide association studies.