Corresponding author: Jason H. Moore, Ph.D., Computational Genetics Laboratory, Norris-Cotton Cancer Center, Dartmouth Medical School, 706 Rubin Bldg, HB7937, One Medical Center Dr., Lebanon, NH 03756. Tel: 603-653-9939; Fax: 603-653-9900; E-mail: email@example.com
A central goal of human genetics is to identify susceptibility genes for common human diseases. An important challenge is modelling gene–gene interaction or epistasis that can result in nonadditivity of genetic effects. The multifactor dimensionality reduction (MDR) method was developed as a machine learning alternative to parametric logistic regression for detecting interactions in the absence of significant marginal effects. The goal of MDR is to reduce the dimensionality inherent in modelling combinations of polymorphisms using a computational approach called constructive induction. Here, we propose a Robust Multifactor Dimensionality Reduction (RMDR) method that performs constructive induction using a Fisher's Exact Test rather than a predetermined threshold. The advantage of this approach is that only statistically significant genotype combinations are considered in the MDR analysis. We use simulation studies to demonstrate that this approach will increase the success rate of MDR when there are only a few genotype combinations that are significantly associated with case-control status. We show that there is no loss of success rate when this is not the case. We then apply the RMDR method to the detection of gene–gene interactions in genotype data from a population-based study of bladder cancer in New Hampshire.
Bladder cancer is the fifth most common cancer in the United States and is blamed for about 3% of all cancer deaths. Like many common diseases, the mechanism of bladder cancer is complex, involving many interrelated biochemical pathways under the influence of many genes. The interactions among multiple genetic factors, which are historically referred to as epistasis, are likely to play an important role in predicting an individual's susceptibility to bladder cancer. Given that interactions are likely to play an important role in the genetic architecture of disease susceptibility, we need to consider interactions of genetic variations in addition to their individual marginal effects. The problem is that interactions are much more difficult to detect and model than marginal effects because of nonadditivity in the genotype-to-phenotype mapping relationship and the curse of dimensionality that is inherent in considering the interaction effects of multiple polymorphisms. Addressing this challenge will require a combination of analysis strategies that come from both a bioinformatics and a biostatistics perspective (Cordell, 2009; Moore et al., 2010). We focus here on a bioinformatics approach to modelling gene–gene interactions that uses machine learning to embrace the complexity of the genetic architecture of common human diseases such as bladder cancer.
The goal of this study was to develop and evaluate a model-free and computationally efficient extension to MDR, Robust MDR or RMDR, to address the modelling limitation described above. We use two simulation studies to demonstrate the strengths of our proposed method. We also apply RMDR to a large, population-based study of bladder cancer from New Hampshire and demonstrate that it is able to identify new gene–gene interactions not previously described.
Methods and Materials
The MDR Algorithm
MDR is a constructive induction or attribute construction approach that seeks to identify combinations of multilocus genotypes that are associated with either high risk or low risk of disease. The general process of defining a new attribute as a function of two or more other attributes is referred to as constructive induction and was first described by Michalski (1983). Constructive induction using MDR was accomplished in this study in the following way (Fig. 1A):
1For a given interaction order M, select M single nucleotide polymorphisms (SNPs) from the dataset.
2Construct a contingency table using these M SNPs and calculate case-control ratios for each multilocus genotype.
3Let T be the ratio of cases to controls in the whole dataset. For each multilocus genotype, if the ratio of cases to controls exceeds T, it is considered high-risk. Otherwise, it is considered low-risk. Once all genotypes are labelled “high-risk” and “low-risk,” a new binary attribute is created by pooling the ‘‘high-risk’’ genotype combinations into one group and the ‘‘low-risk’’ into another group.
MDR uses a simple probabilistic classifier to model the relationship between attributes or variables constructed using MDR and case-control status that is similar to naïve Bayes (Hahn & Moore, 2004). Naïve Bayes classifiers were assessed using balanced accuracy as recommended by Velez et al. (2007). Balanced accuracy is defined as the arithmetic mean of sensitivity and specificity
where TP are true positives, TN are true negatives, FP are false positives, and FN are false negatives. For each dataset MDR evaluates all possible M-way interactions and identifies the best model using balanced accuracy. MDR uses the following ten-fold cross-validation strategy to determine the M that gives the best overall model:
1Divide the dataset into 10 parts. Using 9/10 of the data as training set and the rest as testing set.
2Compute training balanced accuracy for each M-way interaction in the training set.
3Create a MDR attribute for the testing set using the M SNPs that have the largest training balanced accuracy.
4Repeat the procedure 10 times so that each sample is included in testing set once.
5Compute the testing balanced accuracy using the new MDR attribute and the case-control status. For the M-way models that are chosen from the training set, record how many times (cross-validation consistency) they are identified as the best model.
The best MDR model was selected with maximum testing accuracy and maximum cross-validation consistency. The latter is used as a tie break. If both statistics are tied, then the more parsimonious model is chosen as the best model.
A Robust MDR Algorithm
When there is an empty cell or the case-control ratio equals T (the ratio of cases to controls in the whole dataset) MDR algorithm randomly assign this genotype to high- or low-risk group resulting in a balanced accuracy of 0.5 for this cell. When the case-control ratio is close to T in one or more genotypes, MDR's balanced testing accuracy is only slightly over or under 0.5 in those cells. The inclusion of such genotype combinations in the constructive induction process negatively impacts the balanced accuracy of the overall model thus influencing the MDR model fitting process. The goal of RMDR is to provide objective statistical criteria using Fisher's Exact Test for determining whether specific genotype combinations should be included in the overall MDR model with the goal of making the final model more robust. Here, we add an “unknown risk group” for those genotype combinations with a case-control ratio equal or close to T. We then exclude the “unknown risk group” when we calculate the balanced accuracy of the MDR model. The detailed algorithm is described as follows (Fig. 1B):
1For a given interaction order M, select M SNPs from the dataset.
2Construct a contingency table using these M SNPs and calculate case-control ratios for each multilocus genotype.
3For each multilocus genotype, Let a be the number of case and b be the number of control. When T= 1, Run a Fisher's Exact test on contingency table If the P-value is larger than α, group it to “unknown risk group”; otherwise if a > b, it is considered high-risk and if b > a, t is considered low-risk. If T≠ 1, The Fisher's Exact test is applied to , where satisfy, a1+b1=a+b and .
4Once all genotypes are labeled “high-risk”, “low-risk” and “unknown risk group”, a new three-level attribute is created by pooling the “high-risk” genotype combinations into one group, the “low-risk” into one group and the “unknown risk group” into another group.
In step 3, we add the Fisher's Exact Test as a discriminant rule to determine if the case-control ratio is close to T. There are other alternatives such as Pearson's X2Test that can achieve the same goal here. We picked Fisher's Exact Test mainly because it handles small sample sizes gracefully and offers an exact P-value based on the hypergeometric distribution. Since we are aiming to tackle high-order interactions, the number of cases and controls in most of the multilocus genotypes are likely to be small. Fisher's Exact Test is a perfect fit for this situation.
Table 1 shows an example of the discriminant rule set generated by Fisher's Exact Test for a significance level of 0.1 and a balanced number of cases and controls across different sample sizes. For example, when there are a total of 10 samples which have a particular multilocus genotype combination, we label this combination high risk if there are more than seven cases. If there are less than three cases, we label it low risk. Otherwise, we label it as unknown.
Table 1. Discriminant rule of Fisher's Exact Test for different sample number. (α= 0.1 and T= 1).
Number of samples in a genotype
High-risk group case number
Unknown group case number
Low-risk group case number
RMDR uses the same formula (equation 1) to define balanced accuracy, but is limited to samples in the high- or low-risk groups. An important concern with this approach is the coverage which is defined as the ratio of number of samples in the high- and low-risk groups to the total sample size. This is because, in some extreme cases, classification of all but one genotype combination into the “unknown risk group” could result in a balanced accuracy of one from a very small portion of the dataset. To address this concern, we propose to use
as a measurement to determine how well a RMDR attribute associates with case-control outcome. The cross-validation steps are similar to MDR's with exception that we use the score in equation 2 to replace balanced accuracy. Note that MDR is a special case of RMDR when we use α= 1 as the threshold for the Fisher's Exact Test, because the Coverage= 1 and the RMDR score is the same as MDR's Balanced Accuracy. The RMDR approach has been made freely available as part of the open-source MDR software package is available for download from http://www.epistasis.org.
The goal of the simulation study was to compare the success rate of RMDR with MDR for identifying true interaction models. We considered two different simulation scenarios. The goal of the first scenario was to generate data from a single gene–gene interaction model for which the majority of genotype combinations had case-control ratios consistent with the overall case-control ratio in the data. A secondary goal was to evaluate the influence of imbalanced numbers of cases and controls on the performance of RMDR. In the second more comprehensive scenario, we simulated datasets from the 70 different gene–gene interaction models described by Velez et al. (2007). In this scenario, our goal was to assess the performance of RMDR on a more general class of gene–gene interaction models. For both scenarios, we only considered two-way combinations of SNPs with both MDR and RMDR since we wanted to evaluate whether the new method was able to identify the true functional interacting pair of SNPs without the complication of overfitting.
In the first scenario, we simulated datasets based on a nonlinear gene–gene interaction model for which there are by design no marginal effects. The model we selected for illustration purposes is similar to model M186 described by Li & Reich (2000). We simulated 500 datasets from each of the penetrance functions described in Table 2. Here k is the prevalence of the disease and h is a function of the broad-sense heritability (H) and the case-control ratio (T) where . All loci have allele frequencies of 0.5 and genotype frequencies that are consistent with expected Hardy–Weinberg proportions. We simulated data with 200 cases and either 200, 400, or 600 controls. The imbalanced datasets were specifically generated to evaluate the impact of this feature on RMDR. Here, we used case-control ratios of T= (1, 0.5, .33). We picked small-to-moderate heritabilities of H= (0.01, 0.025, 0.05, 0.1). Each dataset consisted of two functional interacting SNPs and 18 independent nonfunctional SNPs. Both MDR and RMDR were applied to each dataset as described above. The success rate was estimated as the number of times MDR or RMDR correctly identified the two functional SNPs out of each set of 500 balanced and imbalanced datasets.
Table 2. Penetrance function for a two-locus epistatic model.
The values in parentheses show genotype frequencies that are consistent with expected Hardy-Weinberg proportions.
In the second scenario we wanted to evaluate RMDR on a more comprehensive and more general set of gene–gene interaction models. Here, we used a comprehensive set of 70 different two-locus penetrance functions that were previously developed by Velez et al. (2007) to evaluate MDR-based methods. The models were distributed evenly across seven broad-sense heritabilities (0.01, 0.025, 0.05, 0.1, 0.2, 0.3, and 0.4) and two different minor allele frequencies (0.2 and 0.4). A total of five models for each of the 14 heritability-allele frequency combinations were generated for a total of 70 models. Genotype frequencies for all 70 epistasis models were consistent with Hardy–Weinberg proportions. A total of 500 datasets were generated for each heritability-allele frequency combination with 100 in total per model. In this study we used a case-control ratio of T= 1 since the results from scenario I and previous results from Velez et al. (2007) suggest that imbalanced data do not influence MDR or RMDR results when balanced accuracy is used. Each pair of functional polymorphisms was embedded within a set of 20 independent SNPs. A total of 7000 datasets were generated and analysed. We estimated success rate as the number of times MDR or RMDR correctly identified the two functional SNPs out of each set of 100 datasets then averaged these results over the five models with the heritability-allele frequency combination.
Real Data Analysis
We demonstrated the RMDR method with real data by applying it to a genetic epidemiology study that examined the relationship between DNA repair gene SNPs, smoking, and bladder cancer susceptibility that was previously analysed using MDR and a 1000-fold permutation test (Andrew et al., 2006). The study analysed 355 bladder cancer cases and 559 controls ascertained from the state of New Hampshire and focused specifically on genes that play an important role in the repair of DNA sequences that have been damaged by chemical compounds (e.g., carcinogens). Seven SNPs were measured including two from the X-ray repair cross-complementing group 1 gene (XRCC1), one from the XRCC3 gene, two from the xeroderma pigmentosum group D (XPD) gene (no known as ERCC2), one from the nucleotide excision repair gene (XPC), and one from the apurinic/apyrimidinic endonuclease 1 gene (APE1). Each of these genes plays an important role in DNA repair. Smoking is a known risk factor for bladder cancer and was included in the analysis along with gender and age for a total of 10 attributes. Age was measured as > or ≤50 years. Pack years of smoking (Pkyrs1) was measured as never smoked, smoked <35 pack-years, or smoked ≥35 pack-years.
A parametric linear statistical analysis of each individual attribute revealed that none of the measured SNPs were significant predictors of bladder cancer individually (P > 0.05). Andrew et al. (2006) used MDR to exhaustively evaluate all possible two-, three-, and four-way interactions among the genetic and environmental attributes. For each combination of attributes a single constructed attribute was evaluated using a naïve Bayes classifier. Training and testing accuracy were estimated using 10-fold cross-validation. A best model was selected that maximized the testing accuracy. The best model had a testing accuracy of approximately 0.63 and included two SNPs from the XPD gene and smoking. They statistically evaluated this model with a 1000-fold permutation test and determined these results to be highly significant (P < 0.001). Post-hoc analysis of the MDR model using entropy-based measures of interaction information revealed that the two XPD polymorphisms had evidence of nonlinear interaction or synergy in the near complete absence of marginal effects. Interestingly, the joint effect of the two XPD SNPs was larger than the independent effect from smoking. The goal of this study was to determine whether RMDR finds the same best model, and, if so, to compare how the MDR and RMDR models differ.
The success rate of RMDR and MDR to detect the true SNP pairs for the M186 model from scenario I is presented in Table 3. For RMDR, we used a more liberal significance threshold of α= 0.1 and a standard significance threshold of α= 0.05. We can see that when the broad-sense heritability or genetic effect size is low (e.g., 0.01 or 0.025) RMDR has nearly double the success rate as MDR in most cases across different case-control ratios (T). The more conservative significance criteria of 0.05 yielded a slightly better success rate for RMDR. Both RMDR and MDR performed similarly at the moderate heritability of 0.1. As described in the methods, the increased success rate of RMDR observed at the lower heritabilities can be attributed to the chosen simulation model described for simulation scenario I. Here, the number of genotype combinations where the ratio of cases to controls was equal to that of the overall dataset exceeded the number of genotype combinations that accounted for the epistasis effect. Under this specific scenario we expected RMDR to outperform MDR and this is what we observed in the simulation for the lower heritabilities.
Table 3. Success rate of RMDR and MDR to detect the correct two-locus model in scenario I.
T= 1, n= 400
T= 0.5, n= 600
T= 0.33, n= 800
The success rate of RMDR and MDR to detect the true SNP pairs across the 70 two-locus epistasis models from scenario II is presented in Table 4. There was no appreciable difference in success rate between MDR and RMDR across different heritabilities and allele frequencies. These results document that RMDR has good success rate to detect gene–gene interactions across a wide-range of models that were not selected to have many genotype combinations where the ratio of cases to controls was equal to that of the overall dataset. This confirms that RMDR is generally applicable, as MDR is, to detecting gene–gene interactions in the absence of marginal effects.
Table 4. Success rate of RMDR and MDR to detect the correct two-locus model in scenario II.
MAF = 0.4
MAF = 0.2
Real Data Analysis Results
We applied RMDR with a threshold α= 0.05 to the bladder cancer study described above and summarize the top one-way, two-way, three-way, and four-way models in Table 5. Here Pkyr1 is a discrete variable with three levels representing increasing pack years of smoking. As reported by Andrew et al. (2006) with MDR, RMDR found that the best model consisted of Pkyr1 and two XPD polymorphisms. We verified the statistical significance of this model using a 1000-fold permutation test and found that this best model had a P-value < 0.001. In other words, an RMDR model with a score as good as or better than our best model was never found in 1000 permutations of the data. The level of significance is consistent with that observed by Andrew et al. (2006).
Table 5. Top 1–4 ways model identified by RMDR-0.05 for bladder cancer study.
xpd.751 xpd.312 Pkyr1
xpd.751 xpd.312 xpcinsrtb Pkyr1
Figure 2A summarizes the original MDR model published by Andrew et al. (2006) while Figure 2B summarizes the new RMDR model. Note the greatly reduced number of genotype-smoking combinations that are labelled high risk or low risk by RMDR compared to MDR. In fact, the number of cells assigned a risk has decreased from 27 to 13. It is much easier to see the marginal effect of smoking since all the cells with Pkyr1 = 2 (smoked ≥ 35 pack-years) are now labelled high risk. It is also easier to see the genotype-dependent effects that comprise the synergistic gene–gene interaction that was described in detail by Andrew et al. (2006).
The goal of this study was to develop a simple method for implementing the MDR method for modelling gene—gene interactions in genetic associated studies that takes into account the ratio of cases to controls for each genotype combination. We introduced here a robust MDR or RMDR method that uses Fisher's Exact Test to assess the statistical significance of each multilocus cell for its association with case-control status before including it in an MDR model. Here, we used two different simulation scenarios to demonstrate RMDR's ability to identify complex associations between genotype and phenotype. In the first simulation, RMDR had a clear advantage over MDR when only four out of nine genotype combinations from a two-way interaction model were associated with case-control status. In the second set of simulations, we demonstrated that RMDR and MDR perform similarly across a wide range of different gene–gene interactions models confirming that RMDR retains a good success rate even when most of the genotype combinations contribute to the association. Finally, we applied RMDR to a genetic association study of bladder cancer susceptibility that had been previously analysed using MDR (Andrew et al., 2006). The RMDR analysis confirmed the previous finding of MDR at a similar significance level but resulted in a much simpler model that reduced the number of genotype combinations from 27 to only 13 thus making this model easier to interpret. We see the improved simplicity of MDR models as a clear advantage of the RMDR approach using Fisher's Exact Test.
Despite these advantages, it is important to point out some disadvantages of this approach. For example, the running of Fisher's Exact Test for every genotype combination does result in a significant computational burden. For example, MDR takes less than one second to search over all one-way, two-way and three-way models in a dataset with 20 SNPs and a sample size of 400. RMDR takes about three seconds to do the same job. We expect that RMDR will take much more time than MDR when focusing on higher order combinations. On the other hand, the extra computational burden can be significantly reduced if we precalculate the discriminant rule (Table 1) for all possible sample sizes in any genotypes. For instance, if we have a dataset with sample size of 1000, we could run Fisher's Exact Test 1000 times to get the discriminant rule. We could store the results in a lookup table and then apply it to any genotype combination. Precomputing the discriminant rules will bring the computational burden of RMDR to a level very close to that of MDR. This will be important for genome-wide association studies, for example.
Future studies with RMDR will investigate measures of entropy as an approach for eliminating multilocus genotype combinations with case-control ratios close to that of the whole dataset. We have previously explored the use of interaction information as a measure of gene–gene interaction (Moore et al., 2006). Bush et al. (2008) have previously used entropy-based contingency table measures to assess overall association of MDR models. However, these measures have yet to be applied to the problem that RMDR addresses. An advantage that entropy might have over Fisher's Exact Test is computational efficiency. A disadvantage is that it typically does not provide a measure of statistical significance. These benefits and potential drawback will be explored and compared within the context of RMDR. It is clear that additional extensions and modifications of the MDR approach will be needed to fully address the complexity of genetic association data and the mapping relationship between genotype and phenotype.
This work was funded by grant #IRG-82-003-22 from the American Cancer Society and NIH grants LM009012, LM010098, AI59694, CA57494 and ES007373.