To identify genetic variants that are responsible for complex human diseases, genome wide association studies (GWAS) often scan single nucleotide polymorphisms (SNPs) across the genome of individuals in case and control groups. Most GWAS focus on single locus analysis by testing the main effect of one SNP at a time. However, single-locus analysis may fail to detect SNPs that do not have significant marginal effects individually but have strong effects collectively (Culverhouse et al. 2002). It is known that genes interact with one another in biological processes like gene regulations, metabolisms, signal transduction, and various development related pathways. So genetic variants in multiple genomic loci may jointly contribute to complex phenotypes (Moore 2003). In the literature it has been reported that many complex diseases, such as sporadic Alzheimer's disease (Combarros et al. 2009), type 2 diabetes (Wiltshire et al. 2006), breast cancer (Ritchie et al. 2001), among others, are associated with interactions of multiple polymorphisms. The phenomenon that the effect of one variant in one gene may depend on those in other genomic loci is known as epistasis. Despite the potential importance of their roles in uncovering the disease etiology, it is difficult to identify epistatic effects in genome-wide settings. To address this challenge, many statistical methods have been proposed and a recent review paper provides a good survey of methods and related software packages for detecting epistasis (Cordell 2009). According to the author, these methods include exhaustive search algorithms, data-mining and machine learning related approaches, and Bayesian model selection methods.
Among the class of machine learning methods is recursive partitioning that produces tree-structure models (Breiman et al. 1984; Zhang & Bonney 2000; Nelson et al. 2001; Cook et al. 2004). Figure 1 depicts an example of a tree model. In tree models, each nonterminal node defines a splitting rule based on a predictor variable. A path from the top node to each terminal node corresponds to a unique mapping from the predictor space to a specific outcome, depending on the values of all predictor variables along that path. Therefore, each terminal node represents a particular combination of values for all variables on the path, and thus naturally allows epistatic effects of those variables in the model. In addition, due to the fact that there can be multiple levels of nodes involving two or more variables, tree based models also allow detection of multi-way interactions. Since the partition of the predictor space is constructed in a recursive manner, the splitting of a variable is conditional on the values of other variables in its ancestral nodes in the tree.
A common practice of searching the tree space is through a greedy algorithm where at each node the splitting variable and its corresponding partition rule is determined by choosing the one, from the pool of all available variables and splitting values, that maximizes the separation of the resulting partition. Thus, this type of algorithms have the limitation that they may fail to identify those interactions that do not display substantial marginal effects (Cordell 2009). To alleviate this problem, algorithms based on Bayesian modeling were proposed to stochastically search promising classification trees through Markov chain Monte Carlo (MCMC) modelling (Chipman et al. 1998; Denison et al. 1998). Moreover, methods based on Bayesian analysis to detect epistasis association have been proposed by several authors (Lunn et al. 2006; Zhang & Liu 2007). The idea of Bayesian classification trees is closely related to Bayesian model selection in which a prior is assigned to all tree models and it serves the purpose of controlling the sizes of trees. One advantage of such prior specification is that it ensures splitting of a variable with a certain probability even though it does not exhibit a strong marginal effect. As a result, this method may enhance the probability of finding epistatic effects whose marginal effects are weak. Besides, the MCMC algorithm also has the adaptive property, where it tends to search more thoroughly in the vicinity of trees containing the interacting variables already found in previous iterations. Thus it allows the detection of multiway interactions. This desirable feature is distinct compared to other methods based on ensemble trees, like the ones using random forests (Breiman 2001; Lunetta et al. 2004; Bureau et al. 2005), in which trees are constructed independently and so are ‘memoryless’ of promising trees visited previously. As a result, potentially important interactions may be diluted in the ensemble consisting of a large number of trees, making it difficult to uncover possible multi-way interactions.
In the next section, we will provide detailed description of binary classification trees and the Bayesian treatment of model search, followed by illustrations of the approach through simulation studies and a real data example.