## Introduction

Genome-wide association studies (GWAS) have successfully identified many single nucleotide polymorphisms (SNPs) or candidate genes that are associated with complex diseases such as cancers. An important follow-up procedure is to characterize their effects on disease risk, including any interactions among them or with environmental exposures, and to identify undiscovered genes (Cordell, 2009; Manolio et al., 2009; Thomas et al., 2009; Cantor et al., 2010). Since complex diseases are caused by a network of interacting factors, the ideal analysis is to simultaneously consider multiple loci, environmental factors, and their interactions. Such joint analyses would improve the power for detection of causal effects and hence potentially lead to increased understanding about the genetics of diseases.

There are considerable challenges, however, to perform joint analyses of multiple genetic and environmental variables. First, with multiple SNPs and environmental factors, there are many possible main effects and interactions, most of which are likely to be zero or at least negligible, leading to high-dimensional sparse models. In addition, there are many more possible interactions than main effects, requiring different modeling and parameterization for main effects and interactions. Second, genetic association studies usually genotype SNPs with strong linkage disequilibrium (LD), introducing highly correlated variables. Third, SNP data often include genotypes with low frequencies that create predictors with near-zero variation. Finally, separation, which arises when a predictor or a linear combination of predictors is completely aligned with the outcome, is a common phenomenon in logistic regression with discrete predictors (e.g., Gelman et al., 2008). Because SNP data are discrete, separation can be a serious problem in case–control association studies. These and other complications result in challenges in terms of modeling, computation, and interpretation, and thus require sophisticated techniques.

One way to handle these problems is by using a Bayesian or penalized likelihood approach that uses appropriate prior information on coefficients to obtain stable, regularized estimates. Various prior distributions or penalties have been suggested. Park & Hastie (2008) propose logistic regression with quadratic penalization (i.e., normal prior) to fit gene–gene (G × G) and gene–environment (G × E) interactions. They show that the penalized logistic regression overcomes the problems and outperforms other popular methods such as multifactor dimensionality reduction (MDR) (Ritchie et al., 2001) and tree-structure learning method (FlexTree) (Huang et al., 2004). Malo et al. (2008) apply ridge regression to fit all SNPs in a single genomic region and show that such multiple-SNP analyses accommodate LD among SNPs and have the potential to distinguish causative from noncausative variants. Several authors have made substantial progress in adapting alternative Bayesian or penalized high-dimensional models to genetic association studies (e.g., Tanck et al., 2006; Xu, 2007; Hoggart et al., 2008; Yi & Banerjee, 2009; Wu et al., 2009; Sun et al., 2010).

In this paper, we describe Bayesian generalized linear models (GLMs) for simultaneously analyzing environmental exposures, main effects of numerous loci, G × G and G × E interactions. Following Gelman et al. (2008), we use the Student-*t* family as prior distributions on the coefficients, with the innovation that different hyperparameters are assigned to different types of effects (i.e., main effects, G × G and G × E interactions). This prior specification allows us to reliably estimate both main effects and interactions and hence to increase the power for detection of real signals. We show that our approach includes many previous methods as special cases and thus inherently takes over advantages of the previous approaches. Similar to Gelman et al. (2008), we fit our Bayesian GLMs by incorporating an expectation–maximization (EM) algorithm into the usual iteratively weighted least squares as implemented in the general statistical package R. This strategy leads to stable and flexible computational tools and allows us to apply any GLMs to genetic association studies. Our algorithm differs from that of Gelman et al. (2008) in treating variances rather than coefficients as missing data and hence avoids computationally intensive matrix calculation in the E-step. Another methodological contribution made in this paper is a novel method to interpret and visualize models with multiple interactions by computing the average predictive probability. The proposed method has been incorporated into the freely available package R/qtlbim (Yandell et al., 2007).

We demonstrate the effectiveness of our method in an application to a large case–control study of adiponectin genes and colorectal cancer risk described in Kaklamani et al. (2008). Current epidemiological evidence suggests an association between obesity, hyperinsulinemia, and colorectal cancer risk. Adiponectin is a hormone secreted by the adipose tissue, and its serum levels are inversely correlated with obesity and hyperinsulinemia. Kaklamani et al. (2008) is the first report of an association between variants of the adiponectin pathway and risk of colorectal cancer. However, their analyses fitted separate logistic models for each SNP and did not consider interactions. Our reanalysis highlights the previous results and detects new epistatic interactions and sex-specific effects that warrant follow-up in independent studies. We evaluate our method using extensive simulations based on the real genotypic data. The simulations show that our method has the potential to dissect interacting networks of complex diseases.