Detecting Epistatic SNPs Associated with Complex Diseases via a Bayesian Classification Tree Search Method

Authors


Corresponding author: Hongyu Zhao, Department of Epidemiology and Public Health, Yale University, 60 College Street, New Haven, Connecticut 06520-8034. Tel: (203) 785-6271; Fax: (203) 785-6912; E-mail: hongyu.zhao@yale.edu.

Summary

Complex phenotypes are known to be associated with interactions among genetic factors. A growing body of evidence suggests that gene–gene interactions contribute to many common human diseases. Identifying potential interactions of multiple polymorphisms thus may be important to understand the biology and biochemical processes of the disease etiology. However, despite the great success of genome-wide association studies that mostly focus on single locus analysis, it is challenging to detect these interactions, especially when the marginal effects of the susceptible loci are weak and/or they involve several genetic factors. Here we describe a Bayesian classification tree model to detect such interactions in case-control association studies. We show that this method has the potential to uncover interactions involving polymorphisms showing weak to moderate marginal effects as well as multi-factorial interactions involving more than two loci.

Introduction

To identify genetic variants that are responsible for complex human diseases, genome wide association studies (GWAS) often scan single nucleotide polymorphisms (SNPs) across the genome of individuals in case and control groups. Most GWAS focus on single locus analysis by testing the main effect of one SNP at a time. However, single-locus analysis may fail to detect SNPs that do not have significant marginal effects individually but have strong effects collectively (Culverhouse et al. 2002). It is known that genes interact with one another in biological processes like gene regulations, metabolisms, signal transduction, and various development related pathways. So genetic variants in multiple genomic loci may jointly contribute to complex phenotypes (Moore 2003). In the literature it has been reported that many complex diseases, such as sporadic Alzheimer's disease (Combarros et al. 2009), type 2 diabetes (Wiltshire et al. 2006), breast cancer (Ritchie et al. 2001), among others, are associated with interactions of multiple polymorphisms. The phenomenon that the effect of one variant in one gene may depend on those in other genomic loci is known as epistasis. Despite the potential importance of their roles in uncovering the disease etiology, it is difficult to identify epistatic effects in genome-wide settings. To address this challenge, many statistical methods have been proposed and a recent review paper provides a good survey of methods and related software packages for detecting epistasis (Cordell 2009). According to the author, these methods include exhaustive search algorithms, data-mining and machine learning related approaches, and Bayesian model selection methods.

Among the class of machine learning methods is recursive partitioning that produces tree-structure models (Breiman et al. 1984; Zhang & Bonney 2000; Nelson et al. 2001; Cook et al. 2004). Figure 1 depicts an example of a tree model. In tree models, each nonterminal node defines a splitting rule based on a predictor variable. A path from the top node to each terminal node corresponds to a unique mapping from the predictor space to a specific outcome, depending on the values of all predictor variables along that path. Therefore, each terminal node represents a particular combination of values for all variables on the path, and thus naturally allows epistatic effects of those variables in the model. In addition, due to the fact that there can be multiple levels of nodes involving two or more variables, tree based models also allow detection of multi-way interactions. Since the partition of the predictor space is constructed in a recursive manner, the splitting of a variable is conditional on the values of other variables in its ancestral nodes in the tree.

Figure 1.

An example of a classification tree: inline image is the proportion of cases; the fraction n1/n is the misclassification rate where n is the total number of individuals in this node and n1 is the number of mis-classified ones.

A common practice of searching the tree space is through a greedy algorithm where at each node the splitting variable and its corresponding partition rule is determined by choosing the one, from the pool of all available variables and splitting values, that maximizes the separation of the resulting partition. Thus, this type of algorithms have the limitation that they may fail to identify those interactions that do not display substantial marginal effects (Cordell 2009). To alleviate this problem, algorithms based on Bayesian modeling were proposed to stochastically search promising classification trees through Markov chain Monte Carlo (MCMC) modelling (Chipman et al. 1998; Denison et al. 1998). Moreover, methods based on Bayesian analysis to detect epistasis association have been proposed by several authors (Lunn et al. 2006; Zhang & Liu 2007). The idea of Bayesian classification trees is closely related to Bayesian model selection in which a prior is assigned to all tree models and it serves the purpose of controlling the sizes of trees. One advantage of such prior specification is that it ensures splitting of a variable with a certain probability even though it does not exhibit a strong marginal effect. As a result, this method may enhance the probability of finding epistatic effects whose marginal effects are weak. Besides, the MCMC algorithm also has the adaptive property, where it tends to search more thoroughly in the vicinity of trees containing the interacting variables already found in previous iterations. Thus it allows the detection of multiway interactions. This desirable feature is distinct compared to other methods based on ensemble trees, like the ones using random forests (Breiman 2001; Lunetta et al. 2004; Bureau et al. 2005), in which trees are constructed independently and so are ‘memoryless’ of promising trees visited previously. As a result, potentially important interactions may be diluted in the ensemble consisting of a large number of trees, making it difficult to uncover possible multi-way interactions.

In the next section, we will provide detailed description of binary classification trees and the Bayesian treatment of model search, followed by illustrations of the approach through simulation studies and a real data example.

Materials and Methods

Binary Classification Trees

There are two types of nodes in a binary classification tree—internal nodes represented by ovals and terminal nodes represented by rectangles as shown in Figure 1. Each internal node has an associated splitting rule and exactly two offspring called child nodes. The splitting rule uses a feature or variable, like the genotype of a SNP or the age of an individual, to assign an observation to either the left or right child nodes. The classification process starts from the top node that is called the root node. At each internal node, an observation is classified to one of the two child nodes according to its feature value and the splitting rule. After moving down along the branches of the tree, the observation finally reaches one of the terminal nodes. Therefore, all terminal nodes represent a partition of the feature space. A general principle of the partitioning process is to make individuals in a terminal node as homogeneous as possible in terms of the outcome, while different terminal nodes are heterogeneous. For instance, in the example shown in Figure 1, there are 500 cases and 500 controls and two SNPs X1 and X2, each taking one of the values 0, 1 and 2 corresponding to the three genotypes. The rule in the root node classifies individuals with X2∈{1, 2} to the left child and all others to the right child, which is a terminal node. After this step, terminal node E contains 280 individuals who are all controls. Thus the misclassification rate at this node is 0/280. For those in the left child node, 500 are cases while the remaining 220 are controls, which are further split based on their X1 values. The partitions take place iteratively and each individual eventually reaches one of the five terminal nodes.

Note that a classification tree can naturally represent the epistatic interaction among features. For example, terminal nodes C and D represent the interactions of X1∈{0, 1}∩X2∈{2} and X1∈{0, 1}∩X2∈{1}, respectively. It shows that the effect of X1 depends on the genotype of X2– individuals with X1∈{0, 1} have low risk when X2=1 but their risk is very high when X2=2. Similarly individuals with X1= 2 belong to the high risk group only if they also carry genotype 1 or 2 in X2; otherwise their risk is low.

Bayesian Classification Tree Search Method

Consider a case-control sample of n subjects. For individual i, yi is a binary response taking values 0 (control) and 1 (case); and xi=(xi1, ⋯, xik) are genotypes of k SNPs. Let T denote a binary tree like the one shown in Figure 1. Note that T is the parameter of interest on which we will assign a prior distribution. Recall that paths from the root node down in tree T naturally represent interaction relationships among features so that inferences of epistatic interactions can be made from the posterior distribution of T. Define function m(T) to be the number of terminal nodes of T, and for notation simplicity we write m in places of m(T). For a given tree T, in terminal j, let tj be the set of all individuals in j and pj be the mean response of y. Here the pj are model parameters, upon which we will put prior distributions and which will be integrated out, as will be shown next. Noting that the distribution of yi for all individuals in j is i.i.d. Bernoulli, the likelihood function can be written as

image

where inline image.

Now we proceed to address the problem of choosing priors for parameters (p1, ⋯, pm, T), which can be expressed in a conditional form π(p1, ⋯, pm, T) =π(p1, ⋯, pm|T)π(T). Note that we are interested in the posterior of T and want to integrate out all pj's. A natural choice for the conditional prior distribution of π(p1, ⋯, pm|T) is, assuming conditional independence, a Beta(γ1, γ2) conjugate prior for pj given T. Under this prior, the marginal likelihood is:

image(1)

where nj0 and nj1 are the number of controls and cases in node j, respectively. Note that setting γ1 and γ2 to 1 yields the uniform prior on pj.

Next we consider specifying prior distributions for the tree model T. In the literature several priors have been proposed including the prior derived from a stochastic tree-generating process (Chipman et al. 1998), a truncated Poisson prior on the number of terminal nodes (Denison et al. 1998), and the pinball prior (Wu et al. 2007). Here we follow Chipman et al. (1998) because it is intuitive to understand and has the advantage of easy implementation. This prior distribution, denoted by π(T), does not have a closed-form expression; rather, drawing from π(T) is through a stochastic tree generating process as follows. Starting from the trivial tree that has only one singleton node, a terminal node η will split with probability

image(2)

where dη is the depth of η, and α and β are hyperparameters. If it splits, to form a splitting rule, a SNP is randomly drawn from the pool of all available ones, followed by a random selection of available genotypes of that SNP to determine its left and right child nodes. Finally, we set η to the newly created left and right child nodes, and repeat the procedure recursively. Note that the left and right nodes are constructed independently. The splitting probability is α for the root node and it decreases at a rate (1 +dη)−β as the tree becomes large. As a consequence, this prior penalizes unbalanced or large trees, and tree sizes are controlled by hyperparameters α and β. Also, the splitting rule ensures equal probabilities for all SNPs under consideration and thus is non-informative.

The posterior of T is p(T|x, y) ∝p(y|x, T)π(T). Although the space of T is finite, it is infeasible to exhaustively evaluate all possible trees in the genome-wide setting. Here the Metropolis–Hastings algorithm described below can be applied to draw from the posterior distribution.

  • 1Set an initial tree T0. It can be any tree and the simplest choice is the trivial singleton tree;
  • 2At step i, propose a candidate tree T* from a transition function q(Ti, T*), and set Ti+1=T* with probability
    image(3)
    Otherwise keep the current tree, i.e., set Ti+1=Ti .

In step 2, it is important to specify the transition function q. Here we follow Chipman et al. (1998) and consider the transition q(Ti, T*) that randomly chooses one from four operations: Grow, Prune, Change and Swap. Details can be found in Chipman et al. (1998), and here we provide a brief description. In the Grow step a randomly chosen terminal node η is split into two child nodes according to the similar procedure as in the prior drawing step. The Prune step is exactly the reverse operation of Grow in which two randomly selected sibling nodes are pruned. In the Change step one randomly picks an internal node and changes its splitting rule at random. The Swap step randomly selects a parent-child pair that are both internal nodes and swaps their splitting rules. Note that the Grow and Prune steps are counterparts of each other, as mentioned above, and the Change and Swap operations are counterparts of themselves. This feature is appealing because it results in a reversible Markov chain that will ensure the convergence to the posterior distribution. Moreover, it also can greatly simplify the evaluation of (3) (Chipman et al. 1998).

Results

Simulated Data

To evaluate the performance of the Bayesian tree model, first we conduct simulation studies. We use 12 two-locus interaction models considered by several authors (e.g. Neuman & Rice 1992; Schork et al. 1993; Knapp et al. 1994; Becker et al. 2005) plus 2 additive models used by Chen et al. (2007) in their publications for epistasis detection. We simulate 500 cases and 500 controls according to the diseases models listed in Tables 1 and 2. We assume there are two disease loci that are in linkage equilibrium, and 98 non-disease loci. Each locus is diallelic, in Hardy-Weinberg equilibrium (HWE), and unlinked to any other. The two disease loci are denoted by SNP1 with alleles A and a, and SNP2 with alleles B and b, respectively. The two-locus penetrance and relative risks (RR) are shown in Table 1. The minor allele frequencies (MAF) and marginal relative risks of the two disease loci are listed in Table 2. Note that the disease prevalence and the percentage of phenotypic variance explained by the two disease loci, shown in Table 2, are fully determined by the penetrance and MAF parameters specified in the two tables, given the model assumptions. MAFs of the 98 non-disease loci are simulated at random from a uniform distribution Unif[0.05, 0.50]. Detailed information can be found in Knapp et al. (1994) and Chen et al. (2007).

Table 1.  Prevalence and odds ratio of two-locus epistatic models.
Model* PenetranceRRModel PenetranceRR
BBBbbbBBBbbbBBBbbbBBBbbb
  1. *:1–Epistasis model; 2–Heterogeneity model; 3–Additive model; s–symmetrical; u–unsymmetrical.

  2. : The baseline risk is the population disease prevalence listed in Table 2.

Ep-11,sAA0000.00.00.0Het-22,uAA00.660.660.06.66.6
Aa00.710.710.07.17.1 Aa00.660.660.06.66.6
aa00.710.710.07.17.1 aa0.660.880.886.68.88.8
Ep-21,uAA0000.00.00.0Het-32,sAA0010.00.013.5
Aa0000.00.00.0 Aa0010.00.013.5
aa00.780.780.07.87.8 aa11113.513.513.5
Ep-31,sAA0000.00.00.0S-12,sAA00.520.520.05.25.2
Aa0000.00.00.0 Aa0.520.520.525.25.25.2
aa000.90.00.09.0 aa0.520.520.525.25.25.2
Ep-41,uAA000.910.00.09.1S-22,uAA00.570.570.05.75.7
Aa000.910.00.09.1 Aa00.570.570.05.75.7
aa00.910.910.09.19.1 aa11110.010.010.0
Ep-51,sAA0000.00.00.0S-31,sAA000.510.00.05.1
Aa000.800.00.08.0 Aa00.5110.05.110.0
aa00.800.800.08.08.0 aa0.51115.110.010.0
Ep-61,sAA0010.00.014.3Ad-13,uAA0.010.010.010.10.10.1
Aa0010.00.014.3 Aa0.020.300.800.11.74.6
aa11014.314.30.0 aa0.040.800.800.24.64.6
Het-12,sAA00.500.500.05.05.0Ad-23,uAA0.050.050.050.20.20.2
Aa0.50.750.755.07.57.5 Aa0.100.320.800.51.53.7
aa0.50.750.755.07.57.5 aa0.150.800.800.73.73.7
Table 2.  MAF, prevalence, and odds ratio of two-locus epistatic models.
Model*MAFDisease
Prevalence
RR of SNP1RR of SNP2Phenotypic Variance
Explained by 2 SNPs
SNP1SNP2AAAaaaBBBbbb
  1. *: 1–Epistasis model; 2–Heterogeneity model; 3–Additive model; s–symmetrical; u–unsymmetrical.

  2. : The baseline risk is the population disease prevalence listed in column 4.

Ep-11,s0.2100.2100.1000.02.72.70.02.72.767.4%
Ep-21,u0.6000.1990.1000.00.02.80.02.82.875.3%
Ep-31,s0.5770.5770.1000.00.03.00.00.03.088.9%
Ep-41,u0.3720.2430.1000.50.53.90.01.39.190.1%
Ep-51,s0.3490.3490.1000.01.04.60.01.04.677.7%
Ep-61,s0.1900.1900.0700.50.513.90.50.513.9100%
Het-12,s0.0530.0530.1000.55.25.20.55.25.246.1%
Het-22,u0.2790.0400.1000.50.56.70.56.76.763.5%
Het-32,s0.1940.1940.0740.50.513.50.50.513.5100%
S-12,s0.0520.0520.1000.55.25.20.55.25.246.9%
S-22,u0.2280.0450.1000.50.510.00.56.06.077.3%
S-31,s0.1940.1940.1000.22.06.80.22.06.859.3%
Ad-13,u0.3490.3490.1730.11.42.80.11.42.748.5%
Ad-23,u0.3490.3490.2150.21.32.40.41.22.235.2%

For the Bayesian classification model, we test four different sets of values for the hyperparameters (α, β), namely (0.8,1), (0.8,1.8), (0.95,1) and (0.95,1.8). We draw trees from their prior distributions and plot the prior distribution of the number of terminal nodes in Figure 2. To understand the effect of hyperparameters, we use α= 0.95 and β= 1.8 as an example. This prior specifies that any SNP can be split at the root node (the first level node) with a prior probability 0.95; but this probability decreases rapidly to 0.27 and 0.13 at the second and third level, respectively. In general, large values of α will reduce the probability of getting a singleton tree (i.e., trees with only one node), whereas large values of β will prevent a tree from growing, reducing the probability on large trees. Indeed this is what we observe from Figure 2. It is clear that the priors with α= 0.95 generate fewer singleton trees than α= 0.8. The priors with β= 1 have more prior weights on larger trees than β= 1.8. Note that the posterior probability of each split depends on both the prior probability and conditional effect. Thus, at the root node a SNP would have a certain posterior probability for splitting, even if its marginal effect is weak. However, at the second level, a SNP would be split only if it has a reasonably large conditional effect given the first SNP. In other words the epistatic effect of these two SNPs must be large.

Figure 2.

Prior distribution of number of terminal nodes with various hyperparameters.

For comparison, we consider an exhaustive search of all two-way interactions using PLINK (Purcell et al. 2007) with the fast epistasis option ‘fast-epistasis,’ which is known to yield very similar results to logistic regression with all pairs of SNPs, but is more computationally efficient. For each disease model we simulate 50 case-control data sets. Table 3 shows the comparison in terms of power and false positive rate (FPR) based on these 50 simulation runs. For the PLINK method, the power is defined as the proportion of the interaction between SNP1 and SNP2 being significant at 0.10 level after the Bonferroni correction for 4,950 comparisons, i.e., p value is less than 0.10/4950. The FPR is the proportion of detecting false two-way interactions. For the Bayesian classification tree, we run MCMC with three random restarts and each has 4000 iterations. The tree with the largest posterior probability is reported. The power of the interaction is defined as the proportion of having at least one terminal node that involves spliting on both SNP1 and SNP2. Finally the FPR is defined as the proportion of having at least one terminal node that involves splitting on two or more SNPs other than SNP1 and SNP2. From the table we can see that the Bayesian tree is powerful in detecting the epistasis. The performance of different hyperparameters is quite similar, suggesting that it is not sensitive to the choice of hyperparameters in this case. On the other hand, the PLINK fast epistasis search fails to identify the epistasis in half of the 14 models, and has lower power than the Bayesian classification tree in the other half of the models. We also note that we conducted a two-stage search algorithm using logistic regression (results not reported here), in which single-locus analysis was done in the first stage to identify the top 10 most significant SNPs, and then in stage 2 we performed exhaustive search of all possible two-locus models (with two-way interactions included) involving these 10 SNPs. The power of detecting epistasis of this two-stage search approach was low. The reason is that in many cases the marginal effects are elusive and are missed in the first stage, which leads to a poor level of power in identifying the epistatic effect.

Table 3.  Power and FPR comparison of detecting epistasis.
ModelBayesian classification tree with various hyperparametersPlink
α= 0.80, β= 1.0α= 0.80, β= 1.8α= 0.95, β= 1.0α= 0.95, β= 1.8Fast Epistasis
PowerFPRPowerFPRPowerFPRPowerFPRPowerFPR
Ep-10.960.020.920.080.920.080.980.140.320.06
Ep-20.980.100.940.120.980.120.940.0600.12
Ep-310.100.980.140.980.140.980.1400.06
Ep-40.980.0410.0410.0610.0800.14
Ep-510.0410.0410.0610.0400.04
Ep-60.980.0410.060.980.021000.04
Het-10.90.060.940.100.940.100.980.080.120.22
Het-210.1210.080.980.2010.060.360.12
Het-30.980.0210.040.960.080.980.0600
S-10.980.080.960.100.920.120.860.060.580.06
S-210.020.980.100.960.060.960.100.620.06
S-310.1010.1610.1410.1600.08
Ad-10.980.260.980.1010.220.980.080.100.12
Ad-20.980.140.960.120.980.080.980.160.400.10

Crohn's Disease Data

Next we use a case-control data set of Crohn's disease (Duerr et al. 2006) to demonstrate the use of the Bayesian classification model. Crohn's disease is an ongoing autoimmune disorder that causes discontinuous and transmural inflammation in the digestive tract. It most commonly affects the lower part of the small intestine called the ileum. Crohn's disease has been found to have a strong genetic component (Peeters et al. 1996). For example, relatives have a 20–30 fold increased risk compared to non-relatives, and monozygotic twins have a 10–50 fold increased risk compared to dizygotic twins. The disease is believed to involve the interaction of several factors such as genetic susceptibility, the intestinal microbial flora inside the patient, the immune response to these microbiota, and triggers involving environmental factors (Sartor 2006).

Here we apply the Bayesian tree to the cohort containing 401 cases and 433 controls. For quality control, we exclude SNPs with a call rate lower than 0.99, minor allele frequency lower than 0.05, or HWE p value lower than 0.001. In addition, all subjects with a call rate less than 0.95 are removed from the analysis. Finally a total of 397 cases and 431 controls pass the quality threshold and are kept in the analysis. We first conduct single SNP association tests and select the top 5000 ones based on the p values, and apply the Bayesian classification tree to those 5000 SNPs. The main reason for choosing the top 5000 SNPs is a balance between statistical power and computational efficiency. A premise of this selection is that most interactions would involve genes with weak to moderate marginal effects, so focusing on the top ones would likely capture most interactions unless the interaction patterns are such that there is no main effect at all. In the simple case of two-way interactions, our previous analytical work (Wu & Zhao 2009) and a follow-up study (Wu & Zhao 2010, unpublished data) suggest that the two-stage analysis is among the most efficient approaches, at least in the models considered. The top 5000 SNPs contain many SNPs with weak marginal effects. Actually the p values of these 5000 SNPs are roughly uniformly distributed from 0 to 0.02. Table 4 lists the top 20 SNPs, among others, from these 5000 ones that have the smallest p values of association tests. We run the MCMC with five restarts, each of which has 50,000 iterations. The hyperparameters are set to α= 0.95 and β= 0.5. The best singleton tree picks rs1343151 on chromosome 1 that belongs to IL23R. As can be seen from Table 4, IL23R is among the top genes in the list and has been previously confirmed to be associated with Crohn's disease (Barrett et al. 2008). The best tree having three terminal nodes involves rs2463031 on chromosome 19, ranked number 8 in Table 4, and rs3213255 on chromosome 19. SNP rs2463031 is in the intergenic region between LOC345571 and EFNA5 while rs3213255 is in the intron region of XRCC1. An interesting case is the best tree with 4 terminal nodes, which is plotted in Figure 3. A contingency table of the first two SNPs are shown in Table 5. To examine the performance of classification error of this tree, we test epistasis by an exhaustive search of all 2-locus models, and keep the ones whose p values are below 0.001. Then for these kept models we calculate the misclassification rates and the histogram is shown in Figure 4. In addition, we also put the misclassification rate of the best Bayesian tree with 4 terminal nodes on the same plot. It is clear that the Bayesian method gives an error rate close to the lower bound of all models by exhaustive search. The tree shown in Figure 3 identifies a possible epistasis between rs13611 ∈{1, 2} and rs178900 ∈{1}, where individuals carrying the combination of these genotypes have significantly lower risk (0.27) than others. We notice that this interaction is missed by the exhaustive logistic regression search because the p value of the two-way interaction in the logistic regression model is 0.15. Now we look at functional annotations of these SNPs. In the classification tree, the SNP at the root node is rs136211 that is located on chromosome 22 in the gene region of MYH9, which encodes a non-muscle myosin IIA (NM IIA) heavy chain. Recently NM IIA was found to regulate intestinal epithelial cell restitution and matrix invasion (Babbin et al. 2009). Intestinal epithelial restitution is the closure of mucosal wounds that is heavily influenced by epithelial migration. Epithelial cell migration is known to have a significant contribution to the pathophysiology of intestinal disorders like inflammatory bowel disease. The findings by Babbin et al. (2009) suggest that NM IIA promotes 2-D epithelial cell migration but antagonizes 3-D invasion. The second SNP found in the tree is rs178900 that is located in the intron region of RAB11FIP4 on chromosome 17. RAB11FIP4 is RAB11 family interacting protein 4 that plays regulatory roles in the formation, targeting, and fusion of intracellular transport vesicles (Entrez Gene). The last SNP is rs8055192 that is located on chromosome 16 but it is not in a gene region. Its functional annotation remains unclear to us at this time. The tree model suggests that there may be epistasis among these three loci.

Table 4.  Top SNPs by single-locus analysis.
 SNP*CHRUnadj. PGeneDescription
  1. *1: Best 2-terminal-node tree; 2: Best 3-terminal-node tree; 3: Best 4-terminal-node tree.

1rs751784717.74E-07IL23Rinterleukin 23 receptor
2rs1343151111.31E-06IL23Rinterleukin 23 receptor
3rs7302601121.73E-06  
4rs2076756163.06E-06NOD2nucleotide-binding oligomerization domain containing 2
5rs1048962913.32E-06IL23Rinterleukin 23 receptor
6rs9315762135.89E-06  
7rs933534179.88E-06MSI2musashi homolog 2 (Drosophila)
8rs2463031251.49E-05  
9rs6538370121.53E-05  
10rs925530121.61E-05LOC144404hypothetical LOC144404
11rs7135617121.73E-05TMEM142Atransmembrane protein 142A
12rs1088967711.85E-05IL23Rinterleukin 23 receptor
13rs12320939122.21E-05  
14rs3934658122.69E-05  
15rs4820972222.71E-05EIF4ENIF1eukaryotic translation initiation factor 4E nuclear import factor 1
16rs220184112.94E-05IL23Rinterleukin 23 receptor
17rs102886393.02E-05  
18rs7398558123.13E-05  
19rs4760516123.14E-05  
20rs2066843163.49E-05NOD2nucleotide-binding oligomerization domain containing 2
90rs1789003170.000320RAB11FIP4RAB11 family interacting protein 4 (class II)
153rs32132552190.000537XRCC1X-ray repair complementing defective repair in Chinese hamster cells 1
1593rs1362113220.005566MYH9myosin, heavy chain 9, non-muscle
2370rs80551923160.008312  
Figure 3.

The best 4-terminal-node tree for Crohn's disease data. Note: -9 denotes the missing genotype.

Table 5.  Contingency table of rs136211 and 178900.
rs136211rs178900Total
012-9 (Missing)
  1. In each cell the top row is the proportion of cases; the numbers on the bottom row represent the numbers of controls, cases, and total observations, respectively.

00.580.540.541.000.56
71/98(169)52/60(112)6/7(13)0/1(1)129/166(295)
10.510.270.380.500.42
121/124(245)89/33(122)15/9(24)1/1(2)226/167(393)
20.550.270.75 0.46
40/48(88)35/13(48)1/3(4)0/0(0)76/64(140)
Total0.540.380.460.670.48
232/270(502)176/106(282)22/19(41)1/2(3)431/397(828)
Figure 4.

Misclassification rate.

Discussion

In this paper we have described a Bayesian classification tree model to identify the epistatic SNPs in GWAS. In Bayesian treatment for the classification tree model, there are two key components that determine the posterior model search, that is, the prior specification of all trees and the transition kernel in MCMC. With the prior in (2) derived from a tree generating process, the model allows a SNP to split with a certain posterior probability, even when the marginal effect is not significant. This feature can enhance the power of identifying interactions among genes, as demonstrated in the simulation studies. In addition, due to the adaptive property of the MCMC algorithm, the Bayesian model search also can detect higher-order interactions.

In the real data example, we find that the MCMC algorithm moves rapidly from its initial state toward regions with high posterior probabilities, and tends to make local moves thereafter. This is not very surprising because the transition function proposes trees in the local regions and can hardly move to another mode in the tree space. This finding is consistent with the original authors (Chipman et al. 1998), who proposed to run the MCMC with repeated random starts. Based on our experience this does help to find models that fit the data better. The slow convergence in MCMC may be problematic in some cases, for instance, if one wants to do model averaging or use the posterior distribution to assess the importance of all epistatic interactions. However, it is not of a major concern if our purpose is to find some potentially important interactions instead of the most important one.

We also tested genome-wide search using approximately 260,400 SNPs that pass certain quality control thresholds. The program is written in C++ and runs very fast. It took about 70 minutes to run 1,000,000 MCMC iterations on a PC with 2.5 GHz Intel Core 2 Duo CPU and 4G memory. However, due to the huge number of trees in the search space, the mixing of the Markov chain is slow. Nonetheless, it is still feasible to apply our method to GWAS data on a cluster of servers and allow it to run for a large number of iterations to be more inclusive. However, this may not be the statistically most efficient approach under most interaction models due to the substantially increased model space.

Acknowledgements

This work was supported in part by NIH grants GM 59507, U01 DK062422, 1R01DK072373, and UL1 RR024139, and NSF grant DMS-0714817.

Ancillary