Bayesian Models for Detecting Epistatic Interactions from Genetic Data

Authors


Corresponding author: Jun S Liu, Department of Statistics, Harvard University, Science Center 715, 1 Oxford Street, Cambridge, MA 02138. Tel: 617-495-1600; Fax: 617-496-8057; E-mail: jliu@stat.harvard.edu

Summary

Current disease association studies are routinely conducted on a genome-wide scale, testing hundreds of thousands or millions of genetic markers. Besides detecting marginal associations of individual markers with the disease, it is also of interest to identify gene–gene and gene–environment interactions, which confer susceptibility to the disease risk. The astronomical number of possible combinations of markers and environmental factors, however, makes interaction mapping a daunting task both computationally and statistically. In this paper, we review and discuss a set of Bayesian partition methods developed recently for mapping single-nucleotide polymorphisms in case-control studies, their extension to quantitative traits, and further generalization to multiple traits. We use simulation and real data sets to demonstrate the performance of these methods, and we compare them with some existing interaction mapping algorithms. With the recent advance in high-throughput sequencing technologies, genome-wide measurements of epigenetic factor enrichment, structural variations, and transcription activities become available at the individual level. The tsunami of data creates more challenges for gene–gene interaction mapping, but at the same time provides new opportunities that, if utilized properly through sophisticated statistical means, can improve the power of mapping interactions at the genome scale.

Introduction

A main goal of genetic studies is to discover genetic variations underlying certain human traits. These traits can be categorical (e.g., whether or not a certain disease is present), ordinal, or continuous (quantitative). The data used to enable such discoveries are typically genotypes of a set of genetic markers of a selective sample of individuals. Pedigree-based linkage analyses refer to those genetic studies based on large pedigrees that are enriched with certain targeted trait(s). Population-based genetic association studies use unrelated individuals or individuals from many unrelated small families under a case-control setting. Thanks to the human genome project and its inspired efforts from the scientific community, we have now acquired a huge amount of information regarding the composition and variation of the human genome. The most abundant kind of variations in the human genome are the single-nucleotide polymorphisms (SNPs), which have now been used as genetic markers in virtually all case-control studies. With the advance of sequencing and genotyping technologies, it is now quite common to conduct case-control studies with hundreds of thousands of SNP markers genotyped for thousands of individuals.

In a typical genome-wide association study (GWAS), unrelated individuals among those who have the disease and those who do not have the disease are gathered and genotyped at a large number of SNP markers. Then, researchers attempt to find SNP markers whose genotype frequency distributions are significantly different between case and control individuals. It is also of interest to find multiple genetic loci whose interactions may significantly affect the disease risk, which we refer to as “epistasis.” Since the number of possible interaction combinations among the genotyped markers is astronomical, however, it is a daunting task to catch one or a very few disease-related interactions. Furthermore, since the number of genotyped individuals is quite limited compared to the genotyped markers, it is critically important to develop methods that are not only computationally feasible, but also statistically efficient so as to capture the likely weak effects.

It has been argued that the more advanced methods for detecting epistasis may not be needed because: (1) simpler methods are often enough to capture real genetic effects; (2) often the main effects are significant; and (3) replication is always needed to confirm any finding. These arguments do reflect the truth of a certain aspect of the current genetic research, that is, it is often the case that a slight confounding can throw off any delicate modeling effort. However, we feel that unless we have a powerful method for finding epistasis interactions, we will never know whether the above “reasons” are just our ignorance or nature's rules. As a researcher, one needs to be somewhat forward looking and we anticipate that good methods for discovering gene–gene and gene–environment interactions will become more and more important as more GWAS and genetic epidemiological studies are conducted. In this article, we review the Bayesian epistasis association mapping (BEAM) method of Zhang and Liu (2007), and explore its extensions to quantitative trait loci (QTL) mapping and gene expression trait (eQTL) mapping problems (Zhang et al., 2010).

QTL mapping generally refers to the task of identifying the genetic loci that are responsible for variation in a quantitative trait (such as the yield from a crop, the body fat mass of a mouse, etc.). Pioneering genetic mapping studies can be traced back to over 80 years ago, when Sax (1923) showed that the association between seed weight and seed coat color in beans is due to the linkage between genes controlling weight and the genes controlling color. However, systematic and accurate mapping of QTLs has not been possible because of the difficulty in arranging crosses with genetic markers densely spaced throughout the whole genome. Recently, advances in genetics have made it possible to genotype markers on the genome scale (Botstein et al., 1980). A large number of QTL mapping studies followed the advent of statistical methods (Paterson et al., 1988; Lander & Botstein, 1989) for experimental crosses, in which confounding effects are fully controlled so that phenotypic variations are attributed mainly to genetic factors.

More recently, researchers have begun to measure simultaneously both the gene expression microarray and genetic marker data for each individual and have treated gene expression values as quantitative traits so as to detect genomic regions associated with expression changes of a large number of genes. For example, Brem et al. (2002) detected eight eQTL hot spots in yeast that affect the expression of a group of 7 to 94 genes of related functions. An additional five hot spots were predicted using a larger sample size (Yvert et al., 2003). Such studies are termed eQTL studies and it is also of great interest to detect both epistatic interactions and pleiotropic effects (Brem et al., 2005; Zhang et al., 2010).

A common challenge for detecting epistasis in the three types of genetic data described above is the large number of genotyped SNPs and an astronomical number of possible interaction combinations to select from. Penalized regression approaches become ineffective in these settings due to both their unrealistic model assumption and their computational costs. Many novel methods for detecting epistasis have been proposed in the literature in recent years to overcome the limitations of the classical variable selection approach. In this article, we focus on a class of Bayesian models for epistasis detection developed over the past few years based on the BEAM model of Zhang and Liu (2007). We first review the basic formulation of BEAM and its algorithmic operations. We then extend the BEAM model to incorporate haplotype blocks of highly correlated SNPs. We further extend the BEAM model to the case with a single continuous trait, establishing a close connection with a recently proposed “partition retention” resampling method (Chernoff et al., 2009), and go on to explain the Bayesian partition (BP) method for eQTL studies. In the Examples section, we demonstrate the performances of these Bayesian methods using both real data and simulations, and we summarize our conclusions in the final section.

BEAM Model

The BEAM method (Zhang & Liu, 2007) is a Bayesian approach for detecting both single-locus disease associations and multilocus interactions in case-control studies. The basic rationale behind the BEAM model is that, if some SNPs are associated with the disease, the distribution of their genotypes (or alleles) should be different between cases and controls, otherwise there is no evidence of disease association at those SNPs. To distinguish between interactive and marginal associations of multiple SNPs, the BEAM model defines a set of SNPs to be interactive if the joint distribution of these SNPs fits the data better than the independence model (i.e., the product of their respective marginal distributions). Note that “interaction” is well defined only for those SNPs that are mutually independent (unlinked) a priori, such as SNPs located far apart in the genome. The BEAM algorithm assigns all SNPs into three nonoverlapping groups: group 1 contains SNPs that are marginally associated with the disease, group 2 contains SNPs jointly associated with the disease, and group 0 contains SNPs that are unrelated to the disease. A correct partition of SNPs into the three groups is of direct interest to an association study.

Let Yi denote the disease status and Xi1, …, Xip denote the genotypes of a set of SNPs for individual i. Let I= (I1, …, Ip) indicate the group membership of each marker so that Ij=k if the jth marker is in group k. The BEAM model postulates that, for individual i,

image

where we define inline image. Let (X, Y) be the observed data with N individuals including both treated and untreated. We have the following joint distribution:

image

Note that the indicator vector I is for mutation locations, and is the same for all individuals. P(I) is our prior distribution on the indicator vector. In particular, each SNP has probability p/2 to be in group 1 (or group 2), and 1 −p to be in group 0. P(I) is then a trinomial distribution. We let P(Xij) be a multinomial distribution (it is independent of Yi because the variable is in group 0), denoted as Multinom(θj), with θj following a Dirichlet distribution a priori. Similarly, we let P(Xij|Yi) be Multinom(inline image) with inline image following a Dirichlet prior, and let inline image be Multinom(inline image). For Yi= 1 (treated), the dimension of inline image is equal to the cardinality of the support of inline image, and inline image follows a Dirichlet prior. For Yi= 0 (untreated), we have inline image, that is, an independence model.

More precisely, suppose that conditional on I, we have a set of markers M that belong to group 2. There are d= 3|M| possible combinations of genotypes, where |M| denotes the number of SNPs in set M. Let XM= (n1, … ,nd) denote the counts of genotype combinations in set M observed from a sample of individuals X, which is assumed to follow the distribution Multinom(inline image). With a Dirichlet prior on inline image, we have

image(1)

A proper choice of the Dirichlet parameter ai is 0.5 (Zhang & Liu, 2007), and Γ (·) is the gamma function.

The output of the BEAM algorithm is the posterior probabilities of each SNP belonging to group 0, 1, and 2, respectively. An SNP with greater posterior probability mass in group 1 or group 2 relative to group 0 indicates that the SNP is associated with the disease. By further comparing the posterior probability of each SNP between group 1 and group 2, one can infer if the SNP is marginally associated with the disease or may interact with some other SNPs.

Extension of BEAM to Handle Haplotype Blocks

Compared with earlier methods developed for gene–gene interaction mapping, a clear advantage of BEAM is its mathematical simplicity (it can be viewed as a generalized naïve Bayes model) and its efficiency in large-scale disease association studies. The BEAM algorithm is also among the first that demonstrated the statistical and computational feasibility of detecting gene–gene interactions from ∼100,000 SNPs. But a main limitation of BEAM is that it treats the markers as independent in controls. It is well known that Linkage Disequilibrium (LD) between nearby SNPs exhibits block-like structures in the human genome (International HapMap Consortium, 2005), where SNPs within blocks are highly correlated with each other, and the correlation is broken down by recombination events at block boundaries. Although a first-order Markov chain is implemented in BEAM to account for correlations between adjacent SNPs, it is insufficient to capture the important block-like structures among densely genotyped SNPs.

Many computational methods have been proposed to infer the block structures among SNPs (Reich et al., 2001; Gabriel et al., 2002; Ding et al., 2005; Zhang et al., 2005). Those methods, however, often use ad hoc rules to define SNP blocks, which may or may not be correctly used in association mapping. In addition, many regions in the human genome demonstrate vague structural patterns, of which a measure of block uncertainty is necessary and important. Previous studies showed that SNP-block structures could be affected by SNP density (Wang et al., 2002; Phillips et al., 2003; Wall & Pritchard, 2003), population structure (Wang et al., 2002; Stumpf & Goldstein, 2003; Zhang et al., 2003; Anderson & Slatkin, 2004), and gene conversion (Przeworski & Wall, 2001). As a result, using predefined block structures (such as those provided by the HapMap project or inferred beforehand by existing block partition methods) to model the SNPs in a case-control study may be suboptimal. It is therefore desirable to extend the BEAM method to account for the block-like structure among SNPs under a coherent Bayesian framework that simultaneously infers SNP-block structures and tests SNP associations based on the inferred blocks.

One approach to generalize the BEAM model to account for SNP-block structures is to introduce an SNP-block variable B denoting the locations of block boundaries in the genome. A simple prior distribution of B can be derived from a uniform distribution. That is, assuming that a block boundary occurs between any adjacent SNP pair with a constant probability (ignoring the distance between SNPs), the number of blocks then follows approximately a Poisson distribution, and the size of each block follows approximately an exponential distribution, a priori. Conditional on B, the L SNPs in the data are partitioned into consecutive nonoverlapping blocks. Within each block, the SNPs are assumed to be strongly correlated, and between blocks the SNP correlation is much reduced. Suppose SNPs within an interval [a, b) in the genome form a block. A simple model for the genotype data X[a,b) within the block can be a multinomial-Dirichlet distribution (1) on the counts of various genotype combinations of the SNPs within the block, written as Pr(X[a,b)). If we further assume independence between blocks, the genotype data of all SNPs, conditional on a block partition B, can be written as the product of probabilities of blocks

image(2)

A distinctive feature of our Bayesian model of SNP blocks, as opposed to point estimates of block partitions, is that the block structural uncertainty is automatically accounted for by the posterior distribution of block boundary locations. It is especially useful in regions where the data do not demonstrate a clear block-like pattern among SNPs.

To further model disease associations and interactions based on a block partition B, the group membership variable I in BEAM can be used. Again, SNPs are partitioned into marginal association, interacting association, and no association groups, respectively. Different from BEAM, SNPs within a block are fully correlated, even if the SNPs belong to different groups. That is, the distribution of disease unassociated SNPs (group 0) depends on the disease associated SNPs (group 1 and group 2) within the block, and the distribution of marginally associated SNPs (group 1) depends on the interacting SNPs (group 2) within the block. Compared to the original BEAM algorithm, this extended model has an additional variable of interest, the block partition B of SNPs. The posterior distribution of both I and B can be efficiently learned using Markov chain Monte Carlo (MCMC) algorithms.

BP Modeling for Studying a Single QTL

Consider a sample with N individuals genotyped at M markers and measured with a quantitative trait. We denote individual i's quantitative trait as Yi, and genotypes of M genetic markers as Xi={Xi,j: j= 1, …, M}. The goal of QTL analyses is to discover genetic variations associated with the trait. Since some genetic variation may have a small or negligible marginal effect by itself, but have strong combinatorial effect together with other markers, we propose a model here to capture the joint effect of multiple markers.

Partition Retention

Let Ij indicate whether marker j has an influential effect on the trait value, and let Δ={j:Ij= 1} with | Δ | denoting the size of Δ. Then, the N individuals can be partitioned to K= 2|Δ| subsets according to different combinations of the markers in Δ. Suppose the kth subset includes nk individuals with their corresponding trait values {Yj:j∈ subset  k}. We denote the mean of the trait values for subset k, if nk > 0, as inline image, and the overall mean is inline image. Intuitively, if Δ includes informative markers, then the variances of trait values within each subset should be smaller than that between subsets. Because minimizing the sum of square errors inline image is equivalent to maximizing inline image (with optimization taken over Δ), we might think that V can be a criterion to be used here. However, this quantity always increases when more markers are added. A natural way to counter this effect is to add a penalty term like in regression problems, that is, to define

image(3)

where λ can be chosen from cross-validation and ρ (·) can be any penalty function.

Based on some information-theoretic argument, Chernoff et al. (2009) introduced

image(4)

This quantity increases when informative markers are included, but decreases when markers with “negligible” effects are added into Δ. Earlier, Lo and Zheng (2002) introduced the “backward haplotype transmission association” algorithm detecting epistasis in case-control studies, and the method was generalized to “partition retention” in Chernoff et al. (2009). In essence, this method repeats the following process: a small subset (of size 8, say) of all the markers is chosen at random, then the variables in the subset are recursively considered for removal so as to maximally increase HC until no improvements can be obtained. The remaining markers are recorded. This process is repeated many times (say 100,000), and those markers that have high frequency to be retained are likely informative ones. Although attractive, this method has a few weaknesses. First, there is no explicit model behind the criterion used; second, the resampling approach seems to be wasteful, especially when only a small fraction of markers are influential; third, there is no way to reflect model uncertainty and do model averaging.

As an alternative, we can directly model the distribution of Y given the marker set Δ. Suppose for j∈ subset k, we have YjNk, σ2), and assume that μkN(0,σ2/τ) a priori. By integrating μk out, we have

image

where inline image. Furthermore, we obtain the posterior distribution under the conjugate prior σ2∼ Inv −χ2 (v, s2)

image(5)

where inline imageinline image, and π0(I) is the prior for I (the same as the one specified in the Section entitled BEAM Model). The parameter τ, as well as ν, and s2 for the inverse χ2 probability, are all treated as fixed tuning parameters here. The choice of these parameters may affect the power when the sample size is not large enough and when the assumed Gaussian prior for Y deviates significantly from the real data. We always choose ν > 2 so that the mean of σ2 is finite. Since σ2 is the within-group variance, we expect it to be smaller than the sample variance of the Y. We thus choose s2 so that the prior mean of σ2 is equal to or slightly smaller than the sample variance of Y. The parameter τ/(1+τ) is our prior on the percentage of variation explained by the partition.

A Full BP Model: Two-Way Partitioning

Different genotype combinations for markers in Δ may be further grouped together if they have the same effects on the trait. The grouped genotype combinations can be modeled by a multinomial distribution as in the BEAM model. In other words, the individuals can be grouped/partitioned so that within each group the genotype combination of an individual follows a group-specific multinomial distribution. In this way, we may gain more power in detecting influential markers, and better understand the combinatorial effects of the markers. To achieve this end, we define a latent “individual type” variable Ji taking values from {1, …, G} for individual i, where G is unknown. Conditional on the individual type Ji, we assume that the joint distribution of Xi={Xi,j: j∈Δ} is multinomial and is independent of Yi. That is, we have

image

where K= 3|Δ|, and [Yi |Ji=l]∼Nl, σ2). For markers X.,j not in Δ, we model them as another Multinom(1, pj). We can then derive a joint posterior distribution of I and J after incorporating a set of proper priors for the unknown parameters (e.g., Dirichlet priors for Θl and pj) and integrate the parameters out, which is in the form of

image(6)

where P(X|I,J) =P(XΔ|I,J) P(X−Δ|I,J,XΔ). With Dirichlet priors assigned for Θl and pj, both P(XΔ|I,J) and P(X−Δ|I,J,XΔ) follow Dirichlet compound multinomial distributions (also called Multivariate Pólya distributions) after integrating the parameters out. π1(J) denotes the prior for the individual type vector J. A possible choice of the prior π1(J) is the Chinese restaurant process, which is related to the Dirichlet process.

BP Modeling for eQTL Analyses

Studies in the genetics of gene expression combine gene expression and genotype data in segregating populations to detect loci linked to variations in RNA levels. These loci are referred to as expression quantitative trait loci (eQTL). To date, eQTL studies have been pursued in a number of species ranging from yeast to mouse and human (Brem et al., 2002; Schadt et al., 2003; Morley et al., 2004). A common theme of these studies is to treat thousands of gene expression values as quantitative traits and conduct QTL mapping for all of them. A distinctive feature of the eQTL problem compared with the single QTL study illustrated in the previous section is that we now have simultaneous measurements of many quantitative traits (tens of thousands). Abstractly, the problem has both a very large number of covariates (SNP markers) and a high dimensional response vector (gene expression values). It is desirable to account for both epistasis and pleiotropy.

Most eQTL studies are based on linear regression models (Lander & Botstein, 1989) in which each trait variable is regressed against each marker variable. The p-value of the regression slope is reported as a measure of significance for the association. In the context of multiple traits and markers, procedures such as false discovery rate (FDR) controls (Storey et al., 2005) can be used to quantify family-wise error rates. To study multiple traits together in QTL mapping, Kim and Xing (2009) developed a graph-guided fused lasso (GFlasso). However, the GFlasso does not consider epistatic interactions in the model.

To detect epistatic effects, a two-step model is generally used (Storey et al., 2005; Zou & Zeng, 2009) assuming that at least one locus has a strong marginal effect. Model selection methods were proposed for detecting epistatic effects without main marginal effects (Yi et al., 2007; Manichaikul et al., 2009). However, they are limited to a few clinical traits instead of thousands of expression traits due to computational constraints.

A BP model was proposed (Zhang et al., 2010) to address the issue of multiple traits and multiple markers by modeling the joint distribution of all genes and all markers simultaneously. Under a Bayesian framework, three sets of latent indicator variables for genes, markers, and individuals, respectively, are introduced and then the association between groups of genes and sets of markers are systematically inferred. Here, we briefly review the BP model.

Assuming there are N individuals in a data set, each individual i is measured with G gene expression values denoted as {yig: g= 1, …, G} and M marker genotypes denoted as {xim: m= 1, …, M}. We call a set of correlated expressions and their associated set of SNP markers a module and assume that the observed data can be partitioned into D nontrivial modules plus a null component. The number of nontrivial modules, D, is set in a range based on preliminary results from a single-trait single-marker regression model. Every gene g or marker m belongs to one of the D nontrivial modules or the null module, determined by the gene indicator Ig∈{0, 1, …, D} and the marker indicator Jm∈{0, 1, …, D}. For each module d∈{1, …, D}, the N individuals are further partitioned into nTd types denoted by the individual indicators Kdi∈{1, …, nTd} for i∈{1, …, N}. Each module may have a different number of individual types as well as different ways of partitioning the N individuals. Note that there could be much fewer individual types than the possible number of genotype combinations.

The gene expression traits in module d can be modeled as the sum of the gene effect (αg), the eQTL effect for individual type kk), the individual effect (ri), and an error term

image

where gene g is in module d, k is the individual type of i, and ri and αg are random effects, following independent Gaussian distributions with mean zero.

To model epistasis of the associated markers of module d, we assume that within each individual type k, the genotype vector inline image follows a type-specific multinomial distribution: inline image To avoid overfitting, an exponential prior on the indicator variables is introduced to penalize partitions with high complexity

image

where ngd, nmd, nTd are the number of genes, markers, and individual types in module d, and L is the number of genotypes at each marker. The joint posterior distribution of all unknown variables can be derived as

image

where β represents the set of leftover continuous parameters unable to be integrated out analytically. In order to make inference on the eQTL modules from this posterior distribution, we construct a MCMC method to traverse the joint space of all unknown parameters. Each Markov chain is randomly initialized, and uses the Gibbs sampler and the Metropolis-Hasting algorithm (Liu, 2001) to update the variables. We implement a split-merge algorithm, which is a special case of the reversible jump MCMC (Green, 1995), to update the individual partitions globally. Parallel tempering (Geyer, 1991) is used to help mixing of the Markov chains.

Examples

An Application of the BEAM Algorithm

We tested BEAM on Crohn's disease data extracted from the Wellcome Trust Case-Control Consortium (WTCCC). The data contained SNPs from two genomic intervals, one on chromosome 3 between 139.3 and 144.3 Mb and one on chromosome 5 between 89.7 and 94.7 Mb. The data contained genotypes of 2005 cases and 3004 controls at 1182 SNPs, spanning a total region of 10 Mb. Nonpolymorphic SNPs and SNPs with more than 5% low-quality genotype scores (<0.95) were filtered out. We tested BEAM on these data because there existed an interaction association between SNP rs1427776 and rs13358266, each located in one of the two intervals, respectively. The unadjusted p-values of marginal association of the two SNPs are 5.2e-4 and 5.6e-3, respectively, and the unadjusted p-value of joint association of the two SNPs is 1.5e-9. We ran BEAM on these data for 100,000 burn-in and 100,000 sampling iterations. As shown in Figure 1, BEAM successfully identified the two SNPs rs1427776 and rs13358266 as being associated with Crohn's disease through interactions (solid vertical lines in Fig. 1).

Figure 1.

Posterior probability of marginal association (dashed vertical line) and interaction association (solid vertical line) at each SNP, estimated by BEAM from a data of WTCCC Crohn's disease.

A Simulation Study for the Single QTL Partition Model

We simulated N= 50 haploid individuals (each marker has two values) from model

image

with ɛ∼N(0, 1), β= 4. A total of 200 markers (including the three associated markers) were generated assuming independence. We first tested the partition retention method of Chernoff et al. (2009) with criterion (4). We randomly selected 1000 subsets, each with five markers. For each subset, we recursively eliminate a marker from the set until (4) decreased by an amount larger than a cutoff value (Fig. 2). With cutoff = 0, we only observed the marginal effect of X1, and the combinatorial effect of X2 and X3 was overwhelmed by the noise. With cutoff = 100, selected after repeated experimentations, we were able to pick up the weak interactive effect. The marginal effect of X1 and the combinatorial effect of X2 and X3 were set to be the same, but it was much more frequent for the partition retention method to pick up X3.

Figure 2.

Frequencies of retention of each variable by using criterion (4) and the partition retention method. We tested two ways of recursive variable deletion.

As a comparison, we tested the two Bayesian methods, simulating I via MCMC according to the posterior distribution (5) derived from an explicit Gaussian model, and from the full Bayesian posterior (6). Both Bayesian models found the truly informative markers with ease, and Figure 3 shows the result for the latter simulation. This method not only recovered all the influential variables but also inferred the individual types correctly. For this model, the MCMC sampler was very sticky. We implemented a parallel sampling strategy and also used the simulation results from (4) as starting configurations. More efficient sampling strategies are needed to deal with problems of larger sizes.

Figure 3.

MCMC results under the full Bayesian partition model (6) (lower right: posterior probability of individual types vs. true individual types [jittered] for each individual).

These simulations suggest that the Bayesian methods appeared to work better than the partition retention method, especially for weak effects. The advantage of the partition retention method is its robustness against the scoring function. In other words, because it repeatedly “catches” and “releasees” influential variables, it cannot be “locked” into a local mode, which was a problem for any mode-finding or MCMC-based algorithm. But this is also its weakness. If there are only very few influential variables (say 4) out of a large pool (say 1000), the chance for a randomly selected screening subset to include informative variables is small, and the chance for a subset to include the set of variables that interactively influence the trait value is astronomically small. Furthermore, one needs to repeatedly catch these variables in the initial screening subset in order to retain them with a reasonable frequency. We indeed observed these weaknesses in our preliminary simulation studies.

A Simulation Study of the BP eQTL Model

To test the effectiveness of our BP eQTL model, we simulated 120 individuals with 500 binary markers and 1000 expression traits in the context of an inbred cross of haploid strains. There are eight modules (summarized in Table 1), each consisting of 40 genes, simulated from different epistasis models based on the linear regression framework, which is different from the posited Bayesian model in our analysis. We repeated the simulation 100 times and analyzed the simulated data using two methods: (1) the BP method using parallel tempering (Liu, 2001) with 15 temperature ladders, referred to as BP; (2) the two-stage regression method of Storey et al. (2005), referred to as SR. As shown from the receiver operating characteristic curves in Figure 4, BP achieved a significantly higher power to detect eQTLs compared to SR. There are likely two reasons for this. First, we modeled the coregulated genes as a module so that information from all genes in a given module could be aggregated to improve the signal. Multiple trait mapping has proven to be more powerful than single-trait mapping (Jiang & Zeng, 1995) in the regression framework. Second, epistatic interactions were explicitly modeled so that markers with weak marginal but strong interactive effects could be detected.

Table 1.  Simulation Design and Genetic Variance Decomposition. (Summary of the Design for Simulation Studies)
ModuleModel1% of Var.2Locus 13Locus 24Epistasis5
  1. 1The regression model that was used to generate the “core gene” in each module.

  2. 2The average percentage of variation of genes in the module explained by the true model.

  3. 3The average percentage of genetic variance explained by the first locus.

  4. 4The average percentage of genetic variance explained by the second locus.

  5. 5The average percentage of genetic variance explained by epistasis.

  6. In all modules, the heritability of the “core gene” is 0.6 and the average correlation of the module genes with the “core gene” is 0.5).

ARIx1=1 or x2=1+e0.1530.3380.3390.333
BRIx1=x2+e0.1580.0520.0520.895
CR= 2βIx1=1 or x2=1+β (x1*x2) +e0.1600.4660.4410.088
DRIx1=0,x2=1+ 2βIx1=1,x2=0+e0.1610.1330.1280.739
ERx1+β (x1*x2) +e0.1320.7480.1380.128
FR= 2βx1x2+e0.1690.7360.2310.043
GR= 2βx1Ix1=x2+e0.1680.7430.0500.211
HR= 2βI01+ 1.5βI10+ 0.5βI11+e0.1680.1310.0480.821
Figure 4.

Comparison of the receiver operator characteristic (ROC) curves for the gene-marker pair detection obtained by our Bayesian partition method (BP) and the two-stage regression method (SR). Different points along the ROC curves represent the false positive and true positive counts averaged over 100 simulations at different posterior probability thresholds (for BP) or at different FDR thresholds (for SR). There are 40 genes in each of the 8 modules, which are linked to 2 markers and thus the total number of the true positive gene-marker pairs is 640.

When examining each module individually, the performance difference of these two methods is most prominent when the marginal effect is weak. For example, in modules B, D, and H, the rate of true positive detections of SR never exceeded 5% even at the generous FDR threshold of 90%. In modules E, F, and G where the major marker explains more than 70% of the genetic variation, SR detected the major marker in nearly 50% of the simulations at the 50% FDR threshold, but not the minor marker. In contrast, BP performed superiorly and robustly in all eight modules.

An Application to Yeast Segregant Data

The BP method was also applied to a data set consisting of gene expression and genotypes for 112 segregants from a cross between laboratory (BY) and wild (RM) strains of S. cerevisiae (Brem & Kruglyak, 2005) and detected 29 modules of genes and their associated markers. Among these 29 modules, 20 are linked to a single eQTL while the remaining nine are linked to two eQTLs. Three of the nine linking to two eQTLs give rise to significant epistatic interactions between the two loci. Similar to the simulation result, the BP method identifies significantly more weak gene-marker associations than the simple “single-transcript-single-marker” method. Focusing on modules with significant epistatic interactions, two previously reported epistatic interactions, an interaction between MAT at the chromosome III locus and GPA1 at the chromosome VIII locus (Brem et al., 2005), as well as an interaction between two loci on chromosome XIII and X (Lee et al., 2006) were recapitulated in two modules that the BP method identified. In addition, the BP method identified a module, Module 3, which consists of many daughter cell-expressed genes and is linked to two eQTLs with a significant epistatic interaction, and which is predicted to be under the regulation of AMN1 and BPH1, each located near the two eQTL loci.

Module 3 comprises genes and has the second most significant interaction term (p-value = 6.63e-5). This module is linked to chromosomes II: 548401 and III: 177850. Binding sites for ACE2, a transcription factor that activates expression of early G1-specific genes and that localizes to daughter cell nuclei after cytokinesis, are enriched in this module (p-value = 2.46e-5). AMN1, which encodes a protein required for daughter cell separation and multiple mitotic checkpoints, is the only gene with a cis-eQTL in the module, and is predicted as at least one of the putative regulators for the eQTL hot spot at the chromosome II locus (Yvert et al., 2003; Zhu et al., 2008). The AMN1 allele swap signature (Zhu et al., 2008) overlaps significantly with this module (p-value = 1.77e-11). In addition, of the ten daughter-specific expression (DSE) genes identified in culture-averaged microarray experiments (Colman-Lerner et al., 2001), nine are in our study set and seven of these are included in this module (p-value = 4.97e-12). At the chromosome III locus is BPH1, a gene involved in cell wall organization. The RM version of BPH1 has a deletion in the middle of the coding sequence compared to the BY sequence, which results in an in-frame stop. Therefore, the RM version of BPH1 may not be functional. When BPH1 is knocked out, sporulation decreases (Enyenihi & Saunders, 2003). However, we note that BPH1 is in the null module, suggesting that the BPH1 activity instead of its expression level may be linked to this locus.

To show that module 3 is under the regulation of two loci, we examined the expression of two genes in the module, DSE1 and DSE2. DSE1 and DES2 are upregulated 15.1- and 20.4-fold, respectively, in segregants carrying the BY allele at the AMN1 locus relative to those carrying the RM allele. If we restrict attention to those segregants carrying the BY allele at the BPH1 locus, DES1 and DES2 are upregulated 13.8- and 16.9-fold, respectively, in segregants carrying the BY allele at the AMN1 locus relative to those carrying the RM allele. When the RM version of AMN1 was introduced onto the BY background, DES1 and DES2 were upregulated only 9.7- and 13.5-fold in the BY wild type compared to the BY-engineered strain (Ronald et al., 2005). These results combined suggest that AMN1 alone cannot explain all of the variation in DSE1 and DSE2 expression, but the combination of the AMN1 and BPH1 alleles explains significantly more of the variation (shown in Fig. 5).

Figure 5.

Comparison of the expression of DSE1 and DSE2 in different experiments. DSE1 and DSE2 are two daughter cell-specific genes in module 3. DSE1 and DSE2 are upregulated 15.1- and 20.4-fold, respectively, in segregants bearing the BY allele at AMN1 comparing to segregants bearing the RM allele at AMN1 (white bars). DSE1 and DSE2 are upregulated 13.8- and 16.9-fold, respectively, in segregants bearing the BY allele at AMN1 and the BY allele at BPH1 comparing to segregants bearing the RM allele at AMN1 and the BY allele at BHP1 (grey bars). DSE1 and DSE2 are upregulated 9.7- and 15.3-fold, respectively, in the original BY strain relative to the engineered BY strain with RM allele at AMN1 (Ronald et al., 2005) (black bars). It is clear that segregants categorized by both AMN1 and BPH1 alleles match the experimental result better.

Conclusions

Large-scale interaction association mapping is an extremely challenging problem in statistics, in which conventional methods (such as regression model selection approaches) are often inadequate in terms of both power and computational efficiency. The reviewed BP algorithms provide a new perspective to solving the problem, which perhaps can help overcome some drawbacks of conventional approaches. First, although the models appear complex, the computational complexities of the discussed Bayesian methods are relatively low, with most model parameters analytically integrated out under conjugate priors. Second, the BP methods utilize advanced MCMC algorithms to effectively explore the huge space of possible interaction associations. Exhaustive two-way interactions may be enumerable for hundreds of thousands of variables given the current computing power. It is, however, prohibitive to enumerate higher-order interactions. Also, as we demonstrated in simulation studies, some conventional approaches that perform screening from a large pool of variables can easily miss important variables that are involved in interactions. Third, the reviewed Bayesian models use regularized probability functions to evaluate various associations and interactions of variables, alleviating the users from tedious and sometimes clueless selection of model penalty parameters. We empirically observed that the BP models have a greater power than some existing methods, while maintaining good specificity. Fourth, the prior distribution of association variables in the reviewed models can be easily modified to take account of prior knowledge of disease-related genes, pathways, regulatory elements, and information about how they might interact to influence phenotypes. With advanced high-throughput genotyping and sequencing technology, genome-wide association studies will reveal more of the genetic architecture underlying complex diseases. A challenge in statistics is how to efficiently analyze the genome-wide data and properly integrate information in both genetics and genomics to facilitate future biological discovery.

URLs: BEAM: http://stat.psu.edu/~yuzhang/

The programs for the models extended from BEAM were not available at the time this paper was prepared.

Acknowledgements

YZ was supported by grants from the NIH (R01 HG004718-03). JZ was supported by grants from the NCI (1U54CA149237) and Washington Life Science Discovery Fund (3104672). JSL was supported in part by grants from the NIH (R01-HG02518) and NSF (DMS-0706989).

Ancillary