• GWAS;
  • risk prediction;
  • discriminatory accuracy;
  • ROC curve;
  • AUC;
  • probability of correct classification


Genome-wide association studies (GWASs) have identified hundreds of single nucleotide polymorphisms (SNPs) associated with complex human diseases. However, risk prediction models based on them have limited discriminatory accuracy. It has been suggested that including many such SNPs can improve predictive performance. Here, we studied various aspects of model building to improve discriminatory accuracy, as measured by the area under the receiver operating characteristic curve (AUC), including: (1) How well does a one-phase procedure that selects SNPs and estimates odds ratios on the same data perform? (2) How should training data be allocated between SNP selection (Phase 1) and estimation (Phase 2) in a two-phase procedure? (3) Should SNP selection be based on P-value thresholding or ranking P-values? (4) How many SNPs should be selected? and (5) Is multivariate estimation preferred to univariate estimation in the presence of linkage disequilibrium (LD)? We used realistic estimates of the distributions of genetic effect sizes, allele frequencies, and LD patterns based on GWAS data for Crohn's disease and prostate cancer. Theory and simulations were used to estimate AUC. Empirical risk models based on 10,000 cases and controls had considerably lower AUC than theoretically achievable. The most critical aspect of prediction model building was initial SNP selection. The single-phase procedure achieved higher AUC than the two-phase procedure. Multivariate estimation did not perform as well as univariate (marginal) estimation. For complex diseases and samples of 10,000 or fewer cases and controls, one should limit the number of SNPs to tens or hundreds.