SEARCH

SEARCH BY CITATION

Keywords:

  • Epistasis;
  • case-control study;
  • logistic regression;
  • set association method;
  • logic regression;
  • MDR

Summary

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References

To identify interacting loci in genetic epidemiological studies the application of multi-locus methods of analysis is warranted. Several more advanced classification methods have been developed in the past years, including multiple logistic regression, sum statistics, logic regression, and the multifactor dimensionality reduction method. The objective of our study was to apply these four multi-locus methods to simulated case-control datasets that included a variety of underlying statistical two-locus interaction models, in order to compare the methods and evaluate their strengths and weaknesses. The results showed that the ability to identify the interacting loci was generally good for the sum statistic method, the logic regression and MDR. The performance of the logistic regression was more dependent on the underlying model and multiple comparison adjustment procedure. However, identification of the interacting loci in a model with two two-locus interactions of common disease alleles with relatively small effects was impaired in all methods. Several practical and methodological issues that can be considered in the application of these methods, and that may warrant further research, are identified and discussed.


Introduction

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References

Complex diseases or traits are a result of the interplay between and among genes and environmental factors. One way in which genetic factors can act together in relation to a certain trait is epistatically. Generally speaking, epistasis refers to the phenomenon that the relation between a gene and a trait outcome is dependent on another gene. More specifically speaking, from a statistical point of view, epistasis or gene-gene interaction has been defined as the presence of non-additivity in a mathematical model that describes the relation between genetic variants and a trait in a population (Cordell et al. 2001, 2002).

The ability of multi-locus methods of analysis to detect and incorporate statistical gene-gene interaction in genetic association studies is likely to favour the chance of retrieving causal genes and a better prediction of trait outcome based on genotype data (Cordell et al. 2001; Culverhouse et al. 2002). Researchers that want to identify interacting loci in their genetic association studies through application of multi-locus techniques can nowadays choose from a variety of methods and software packages that are readily available, in addition to traditional multiple regression analysis. The goal of our study was to apply four available multi-locus classification techniques, namely multiple logistic regression, the set association method as implemented in the Sumstat software, the multifactor dimensionality reduction (MDR), and logic regression, on simulated data to identify strengths and weaknesses in the application of these methods. More specifically, we have focussed on their ability to identify causal loci that may be statistically interacting according to some specific underlying model. This will give researchers more insight into the performance of the methods and allow more informed choices about when and how to apply them.

Many multi-locus methods that can be applied to investigate gene-gene interactions have been introduced in the past decade, as reviewed by Hoh & Ott (2003) and Thornton-Wells et al. (2004). A number of these methods are particularly suitable for genetic case-control association studies, and for some of those freely available software has been introduced. The MDR software was introduced in 2003 and was especially developed to analyse high-order joint effects of loci (and environment) in genetic association case-control and sib-pair study designs (Hahn et al. 2003). The general idea behind the non-parametric and genetic model-free MDR approach is that it reduces the dimensionality of the multi-locus data by pooling those combinations of genotypes that can be defined as high-risk and those that can be defined as low-risk, based on the case-control ratio for the specific multi-locus genotype. The reduction of the dimensionality of the data overcomes the problem of a low number of observations in high-order data combinations. An exhaustive search over all possible high-order genotype combinations for a varying number of loci is performed, and the best combination of single nucleotide polymorphisms (SNPs) for a certain model size is chosen based on classification error. Using cross-validation the ability of the new one-dimensional multi-locus variable to predict case-control status, and in addition its cross-validation consistency, is determined. The locus or combination of loci that has the best predictive value and highest cross-validation consistency is considered to be the outcome of interest. An empirical p-value for the testing accuracy and cross-validation consistency of the final selected model can be determined using a permutation testing procedure. The MDR method has successfully identified interacting loci in several real data applications (Brassat et al. 2006; Cho et al. 2004; Coffey et al. 2004; Ma et al. 2005; Ritchie et al. 2001; Williams et al. 2004), and its performance has been evaluated on simulated data with a focus on a number of epistasis models that show no or small single locus main effects (Ritchie et al. 2001, 2003).

Logic regression is a generalized regression methodology that was introduced in 2001 and was also mainly developed to study high-order interactions in genetic studies (Kooperberg et al. 2001; Ruczinski et al. 2004). It has been implemented in freely available software as an R and S-Plus package. In short, the goal of logic regression is to find Boolean combinations of binary predictors, for example SNPs, that are associated with the outcome of interest and that are incorporated into a regression model. The combinations of the binary predictors are efficiently organized in a tree form. The performance is evaluated by comparing the fitted values and response using a scoring function that is dependent on the type of regression. So instead of modelling the interaction between several SNPs using a high number of model parameters in a regression model, a combination can be captured through one tree parameter using Boolean operators; SNPs that are included in the same regression tree instead of in separate trees may be acting in a non-additive way on the scale of interest. The building and selection of the logic trees can be done using a stochastic simulated annealing algorithm, and the size and number of trees can vary. Model selection can be guided by comparing the predictive values of the models, based on cross-validation or permutation testing procedures. The latest software version also includes a Monte Carlo regression option that can identify several best tree models that are related to trait outcome and that may fit the data equally well (Kooperberg & Ruczinski, 2005). The method has been applied to a post PTCA restenosis dataset and has also been successfully used in the analysis of simulated genetic data (Kooperberg et al. 2001; Ruczinski et al. 2004).

The set-association method, as implemented in the Sumstat software, introduced in 2001 by Hoh et al. (2001), is a non-parametric method that uses sum statistics to evaluate the joint effect of loci related to trait outcome. The locus or set of loci associated with the trait of interest is identified by creating and testing sum statistics that capture the combined information from multiple SNPs. Based on single locus test statistics, for instance Chi-square values from 2 by 3 tables, Chi-square values for deviation from Hardy Weinberg equilibrium, or Chi-squares for pair-wise interactions, SNPs are selected and added to the sum statistic that is calculated for an increasing number of loci. The SNPs are chosen based on the value of the test statistic and are added sequentially to the sum statistic in order of decreasing value. The significance of each sum is evaluated using a permutation procedure. Then, the smallest empirical p-value for the sum statistic is in its turn evaluated for global significance via permutation tests, thereby correcting for the testing of multiple sums. So the goal is to find the subset of loci that is most significantly associated with the trait of interest. This method has been evaluated for its power by use of simulated data without a focus on interaction (Kim et al. 2003; Wille et al. 2003), and has been applied in several case-control genetic association studies (de Quervain et al. 2004; Hoh et al. 2001; Maitland-van der Zee et al. 2005).

As well as several advantages, such as transparency, familiarity, and the estimation of interpretable effect parameters, the traditional multiple logistic regression technique has known disadvantages in identifying interacting loci in case-control association studies. Sparse data can easily become a problem in studies with relatively small sample sizes when including (high-order) interaction parameters. Furthermore, the number of tests can be very high when the number of SNPs included in the study, and the order of interactions to evaluate, becomes large. This can all lead to unreliable parameter estimates, inflated type I errors, and low power to detect the associated loci. Methods to deal with multiple comparisons like the Bonferroni correction, the False Discovery Rate (FDR), and randomization procedures, have been introduced as solutions to mitigate elevated type I error rates. Another difference from the previous discussed techniques is that the logistic regression is a parametric, model-based method that requires enumeration of the (interaction) model(s) to analyse; the available choice of model definitions that can be used in logistic regression to investigate gene-gene interactions is large. Logistic regression has successfully identified interacting loci in the past in studies that included a relatively small number of loci, and it was recently shown that it is also feasible to successfully apply the technique to identify interacting loci in genome-wide association studies (Marchini et al. 2005).

The general interpretation of statistical interaction as deviance from additivity makes its presence dependent on the scale that is used (i.e. logit, or penetrance). The multiplicative model on a penetrance scale is often considered to represent the standard statistical interaction model; this model approximates an additive model on the log odds scale, implicit in the standard logistic regression model commonly used in the analysis of case-control studies. However, other statistical models besides the multiplicative also meet the criterion of non-additivity. In complex diseases the statistical interaction present in the collected population data is not known beforehand. It will then be important to know if the applied multi-locus methods can detect the interacting loci for different underlying interaction models. Therefore, we evaluated the performance of the above-mentioned multi-locus methods for a selection of two-locus statistical interaction models.

Materials and Methods

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References

Statistical Interaction Models

We used the penetrance scale to define two-locus interaction models. The penetrance is the probability of being affected given the genotype at, in our case, both loci, and it can range from 0 (no penetrance) to 1 (full penetrance). We simulated 5 two-locus interaction models with reduced penetrance based on the classification schemes described by Li & Reich (2000) and those models described in earlier papers on epistasis (Figure 1). They include a ‘multiplicative model’ (Mod1) and a ‘heterogeneity based model’ (Mod2). The multiplicative is characterized by multi-locus penetrances that result from the product of the single locus contributions. The heterogeneity model is often considered a non-interactive model, in which both loci increase disease risk independent of the genotype at the other locus but it can also be viewed as a non-additive and therefore interactive model, as stated by Cordell (2002).

image

Figure 1. Two-locus penetrances for the models used for data simulation.

Download figure to PowerPoint

Furthermore, we simulated a ‘conditional dominant or recessive model’ (Mod3), where one locus portrays dominant or recessive behaviour dependent on the genotype at the other locus. We also included an ‘exclusive OR’ (Mod4) and ‘missing lethal genotype’ (Mod5) based model. The first reflects the situation in which the risk of disease is elevated if a subject is heterozygous for one locus but not both. In the second model the disease risk is elevated when exactly two disease alleles are present, the risk being higher when both are at the same locus. These last two models are special because they show no main effects of the single loci in case of a disease allele frequency of 0.5.

We also simulated a model in which two two-locus interactions, both based on the ‘missing lethal genotype’ model (Mod5), increased disease risk according to a heterogeneity model (Mod6); the interaction was considered present if either one or the other or both sets of criteria for two-locus interactions were satisfied. The allele frequencies of both markers in both interactions were set to be equal. Note that this model also did not display any marginal single locus effects for the 0.5 frequency. Finally, a null model (ModNull) was generated in which the disease status of 200 out of the 400 randomly selected subjects was set to case.

Data Simulations

The data simulations were performed in Stata Version 8. We simulated the models by defining the penetrance and frequency for all two-locus genotype combinations for the variety of underlying disease models (Figure 1). We assumed that the two bi-allelic polymorphisms were both in Hardy Weinberg Equilibrium and in linkage equilibrium. Four different allele frequency models were generated, where both loci had an expected frequency of 0.5 (f0.5), 0.3 (f0.3), 0.1 (f0.1) or 0.05 (f0.05). In addition to the two causal SNPs eight non-causal SNPs were generated. Three of these had a frequency of 0.5, three of 0.3 and two had an allele frequency of 0.1.

To generate one simulated sample, 100 cases based on a high-risk genotype combination and 300 controls were randomly selected out of the complete simulated dataset with 150000 observations. To obtain a more realistic representation of a complex disease model, we introduced 50% phenocopies into every model by randomly changing the disease status of 100 controls to case status thus creating datasets containing 200 cases and 200 controls. By doing this, we forced the disease prevalence of the population from which the final sample was selected to be twice that of the original simulations, since the 100 original true genetic cases now represented 50% of the total number of cases. As a result, the relative effect of the low frequency causal alleles was higher than that of the more frequent causal alleles. The final expected penetrances, the marginal penetrances and corresponding population prevalence of the disease are depicted in Table 1. For each combination of disease and allele frequency 100 replicates were simulated, which resulted in a total of 2800 datasets.

Table 1.  Population genotypic penetrances for the simulated case-control data after introduction of the phenocopies
ModelMAF1f002f01f02f10f11f12f20f21f22Marginal penetrance3
f0.f1.f2.K4
  1. 1MAF = minor allele frequency.

  2. 2fij represents the penetrance for a two-locus genotype where the number of disease alleles at the first locus is i and at the second locus is j.

  3. 3Marginal penetrance defined by fi. =∑pjfij where pj is the frequency of genotype j.

  4. 4K = disease prevalence.

  5. 5Only marginal penetrances are displayed; here fij represents the marginal penetrance of the ij. multi-locus genotype.

10.50.060.060.060.060.150.150.060.150.150.060.130.130.11
0.30.030.030.030.030.120.120.030.120.120.030.080.080.05
0.10.000.000.000.000.100.100.000.100.100.000.020.020.01
0.050.000.000.000.000.100.100.000.100.100.000.010.010.00
20.50.050.050.140.050.050.140.140.140.140.070.070.140.09
0.30.020.020.120.020.020.120.120.120.120.030.030.120.03
0.10.000.000.100.000.000.100.100.100.100.000.000.100.00
0.050.000.000.100.000.000.100.100.100.100.000.000.100.00
30.50.030.030.030.030.030.130.030.130.130.030.060.100.06
0.30.010.010.010.010.010.110.010.110.110.010.020.060.02
0.10.000.000.000.000.000.100.000.100.100.000.000.020.00
0.050.000.000.000.000.000.100.000.100.100.000.000.010.00
40.50.050.150.050.150.050.150.050.150.050.100.100.100.10
0.30.050.150.050.150.050.150.050.150.050.090.110.090.10
0.10.030.130.030.130.030.130.030.130.030.050.110.050.06
0.050.020.120.020.120.020.120.020.120.020.030.110.030.03
50.50.030.030.120.030.070.030.120.030.030.050.050.050.05
0.30.020.020.120.020.070.020.120.020.020.030.040.070.04
0.10.000.000.100.000.050.000.100.000.000.000.010.080.01
0.050.000.000.100.000.050.000.100.000.000.000.010.090.00
650.50.060.060.120.060.090.060.120.060.060.080.080.080.08
0.30.050.050.120.050.080.050.120.050.050.050.060.080.06
0.10.010.010.100.010.060.010.100.010.010.010.020.090.01
0.050.000.000.100.000.050.000.100.000.000.000.010.090.00

Statistics

Multiple Logistic Regression Analysis

The parametric multiple logistic regression technique requires specification of the statistical model and parameters that we want to evaluate for association with disease. We chose to fit a full logistic regression model for each pair of loci. The loci were coded using two dichotomous dummy variables: one for the heterozygous group and one to indicate the group homozygous for the high-risk allele. The fully fitted logistic regression model therefore contained an intercept, four parameters for the main effects, and four parameters for the interaction terms between the two loci. We applied both the Bonferroni (fixed overall type I error to 0.05) and the FDR method (Simes procedure, FDR 0.05) to deal with the multiple comparisons. Those single loci or interaction parameters that passed the significance criterion were considered our outcomes of interest. The logistic regression analysis was performed in Stata Version 8; we used the ‘multproc’ add-on package written by R. Newson for the Bonferroni and FDR procedure.

Set-association method

The set-association method as implemented in the Sumstat software available at http://linkage.rockefeller.edu/register was applied. The Chi-square statistic for genotypic association with case-control status (2 by 3 table) was chosen as the test statistic. Furthermore, we selected the interaction option that computes the Chi-squares for all pair-wise interactions, leading to a total number of 55 genotypic input variables for every replicate. We calculated sum statistics that contained at most 6 variables (single loci and/or interactions) and used 20000 permutations to generate empirical p-values for the test statistics. For every replicate the subset of single loci and pair-wise interaction terms with the lowest permuted p-value and largest number of terms was the multi-locus combination output of interest.

Logic Regression

We performed the analysis using the Logic Regression module that is available as an R package (version 1.3.1; http://bear.fhcrc.org/~ingor/logic/), and selected the logistic regression option. We used 100,000 iterations in the simulated annealing algorithm, and annealing temperatures were set so the number of acceptions/rejections during the iteration process was optimal. The logic regression can only deal with dichotomous input variables, and therefore the SNPs were recoded into two dichotomous dummy variables as was done for the logistic regression analysis. We used the ‘select the best model’ option and limited the model size to a maximum of 4 and 6 leaves for the two-locus models, and 8 leaves for the 4-locus model. The outcome of interest was the loci present in the final best model, as selected under our model selection settings.

MDR

We performed the analysis using MDR version 0.6.1, available at http://www.epistasis.org/open-source-mdr-project.html. We evaluated model sizes of 2 and 3 loci and 4 loci for the 4-locus genetic heterogeneity models, and counted the number of replicates that contained the causal loci in the final best model that was identified using 10-fold cross-validation. The default values for the parameters that needed to be set were used; the threshold case/control ratio for the definition high-risk genotypes was set to at least 1:1, tie cells were set to affected, and an exhaustive search method configuration was selected.

Results

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References

Figure 2 shows the number of replicates that contained the two causal loci, either both loci separately or via an interaction, for all the scenarios and methods. For the logistic regression and Sumstat results we could indicate the number of replicates in which the loci were identified via an interaction term, and these therefore would have been positive replicates if we would had selected on interaction terms only (black shading). For the genetic heterogeneity model, with two two-locus interactions acting to increase risk of disease, we counted the number of replicates in which the four loci were retrieved, either by identifying the simulated interaction terms or the single locus effects (Figure 3). Because of the low penetrance and low frequency of the risk increasing genotypes for Mod2f0.05, Mod3f0.1 and Mod3f0.05, the final simulated case-control datasets contained too few cases (<100) and were therefore dropped.

image

Figure 2. Number of replicates that contained the two causal loci via both single locus parameters and/or through interaction parameters (black shading) for the 5 two-locus models and the null model. The first to the fourth column represent the models with minor allele frequencies of 0.5, 0.3, 0.1 and 0.05, respectively.

Download figure to PowerPoint

image

Figure 3. Number of replicates per method that contained the four causal loci in Mod6 (black shading) and ModNull (grey shading) for every MAF.

Download figure to PowerPoint

For the traditional logistic regression technique, without correction for multiple testing, the interacting loci were identified in almost all replicates, even when the causal allele frequency dropped to 0.05. The detection rate was diminished for the four-locus models with a frequency of 0.3 and 0.5. After application of the FDR procedure and Bonferroni method, the number of replicates in which the causal loci were detected was somewhat less in most models, but seriously impaired in the models with an allele frequency of 0.5 and the four-locus models, the Bonferroni correction being more conservative than the FDR method. The loci were found specifically through the interaction terms in Mod4f0.5, Mod5f0.5 and Mod6f0.5, where there were no expected marginal effects. Only relying on the interaction parameters generally resulted in lower detection of causal loci, and failed especially in Mod2. It was also worse for the 0.05 MAF models where the interaction terms were dropped due to sparseness of data. Furthermore, we saw that the number of replicates with significant interaction parameters was lower for the FDR corrected results and even lower for the Bonferroni results, in most scenarios.

The number of false positives for the null model in the uncorrected analysis was highly inflated. There was a high false positive detection rate for the null model with a frequency of 0.1 that could not be mitigated by application of the multiple comparison procedures. Looking at the results and analysis more carefully showed that the sparse number of observations in the genotype cells for this allele frequency led to unreliable parameter estimates, and a high rate of extreme low p-values that passed the significance criteria, whereas the parameters of the causal loci were often dropped and not tested in the 0.05 frequency model, due to too few observations in the heterozygote, but especially homozygous, mutant genotype groups. Likewise, the interaction terms were dropped, explaining their absence in the null model findings.

In general, Sumstat performed very well for the 0.3, 0.1 and 0.05 MAF scenarios and not well for Mod1f0.5, Mod2f0.5 and high frequency Mod6. Regarding the last model, in Figure 3 we only counted those replicates as positive if the correct single loci and/or both simulated interaction terms were identified. We saw, however, that in the lower frequency models the four causal loci were identified via interaction terms between one locus from the first two-locus interaction and one locus from the second. If we had counted those replicates as positive results too, the number of replicates in Figure 3 for the 0.1 and 0.05 frequencies would have been 100 and the results for the 0.5 and 0.3 frequencies would have increased to 22 and 32, respectively. For all frequencies the Sumstat method retrieved the causal loci for Mod4 and Mod5 via the interaction terms. In contrast to the logistic regression approach the causal loci for Mod2 were also detected via the interaction term, especially for the 0.1 allele frequency, and Mod3 was mainly dependent on finding the separate single locus effects. The results for both single causal loci (not shown) were expected to be symmetrical, but for some models they differed substantially up to a maximum of 19 replicates for Mod1f0.5. We saw that for this genetic model all methods gave asymmetrical results for the two causal loci; the first locus was retrieved more often than the second. The overall low number of falsely detected loci, as observed in the null model, was low relative to the other methods, even without selection on the basis of the empirical sum statistic p-values or the final global p-value. In the ModNull we saw a higher selection of interaction terms over single terms.

We also checked how the results would change when only the replicates with a global p-value of less than 0.05 were considered. The number of false positives in the null models was reduced to zero. It had marginal consequences for the results of Mod1f0.5, Mod2f0.5, Mod4f0.5, where the number of replicates containing the true loci dropped at most 7, and had large consequences for Mod3f0.05 and Mod5f0.5/f0.3, where the number of replicates that contained the causal loci was lowered to 16, 49 and 28, respectively. For the four-locus models the numbers in Figure 3 changed for the f0.5 and f0.3 models to 4 and 14, respectively.

The results were different if the sum statistic with the smallest number of loci for the smallest p-values, instead of the largest number, was chosen (data not shown). We then saw that the method was much worse for Mod1, Mod2f0.3 and Mod3 in terms of elevated type I error, because the p-value often reached its minimum of zero after including one of the two causal loci. The method was also not able to identify the four loci in Mod6f0.1 and Mod6f0.05.

The logic regression analysis retrieved a high number of replicates containing the true causal loci, with the exception of the high frequency multiplicative model (Mod1f0.5) and Mod6, where it had difficulties finding all 4 loci. The amount of noise varied; Mod1 and Mod3 in general contained at least 3 terms related to the two causal loci, with the exception of Mod1f0.5 and Mod2 which contained just two terms referring to the causal loci and two referring to non-causal loci. The lower frequency Mod5 often contained 4 causal loci terms, and all Mod4 contained 2 to 4 causal terms out of the four leaves. In all scenarios one tree was constructed, often with the maximum number of leaves allowed. The gain from allowing six leaves instead of four was very limited, and led to a higher inclusion of non-causal loci (results not shown). The smaller the model, that is the lower the number of leaves allowed, the lower the number of false positive findings. We observed a tendency for a lower number of false positive findings for low frequency SNPs, reflected in the decreasing values for ModNull from f0.5 to f0.05.

The high frequency four locus Mod6 contained a high number of non-causal loci and a varying number of true loci. The performance for the low frequency scenarios was remarkably better. The difference between the analysis allowing 6 and 8 leaves was most pronounced for the 0.3 and 0.5 MAFs, where we saw a higher number of correct positive findings but also much higher false positive results for the analysis with 8 leaves.

The same trends in results were observed for the MDR as for the logic regression: it retrieved the loci in all situations, except for the high frequency Mod6, and the performance was slightly impaired in Mod1f0.5. Because the model could only contain two loci, no non-causal loci were included in those models that correctly identified the loci. There was no additional value from models containing three instead of two loci; the retrieved number of causal loci remained similarly high but the number of false positives was substantially higher in the high frequency models. Also here we observed a bias towards high frequency SNPs for the null model.

Discussion

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References

We performed a simulation study to identify strengths and weaknesses of four multi-locus methods employed to identify interacting loci in several case-control scenarios. The number of ways in which these methods can be applied is numerous. Due to pragmatic and computational reasons we have not used the methods to their full capacity regarding model selection and model testing options. Therefore we can only draw conclusions about the strengths and weaknesses we encountered in the ways we applied them to the simulated data. Since we did not fix the type I error for all methods in an equivalent way it is not possible to make fair direct comparisons of power between the methods, and we have focussed on identifying strengths and weaknesses for each method separately.

In the application of the logistic regression the known problems in standard regression techniques of inflated findings of false positives, and diminished power caused by the presence of sparse data and multiple testing problems, were encountered, despite there being only 10 loci in these datasets. We can therefore underscore the importance of the use of a model testing procedure, such as permutation tests, even in studies with few genetic variants. Furthermore, extensions of the traditional regression models, that are especially developed to deal with sparse data, like penalized regression models could be applied. We chose to apply a full model containing parameters for the two loci and their interaction terms. We observed that the significance of the interaction parameters in a fully saturated two-locus logistic regression model in the identification of the causal loci was dependent on the underlying genetic models, the allele frequency, and the multiple comparison correction procedure. The use of interaction parameters especially improved the identification of interacting loci in cases of statistical interaction models that showed no expected marginal effects. North et al. (2005) evaluated how well logistic regression models that included different model parameters, including interaction parameters, fitted the data compared to a fully saturated model, and how well they corresponded to the true underlying penetrance based models for a variety of two-locus disease models. They showed that for some models the inclusion of interaction parameters is advantageous but there is no direct correspondence between the interactive effects in the logistic regression models and the underlying penetrance based models displaying some kind of epistasis effect (North et al. 2005). The latter has been confirmed in our study.

An approach that we have not evaluated here but that can be applied to identify gene-gene interactions, is the case-only approach. It can be used to efficiently identify deviation from a multiplicative model for the relative risks by testing the association between loci in the cases. This design can be more powerful than the traditional case-control analysis. However, it is restricted to the identification of interactions only; the main effects of loci cannot be estimated and tested. Furthermore, this approach is sensitive to deviation of the underlying assumption of independence of the loci in the general population (Albert et al. 2001); bias will be introduced if linkage disequilibrium between the loci of interest exists in the control population.

The Sumstat method had difficulty finding the causal loci in the high frequency four-locus models, but for the other scenarios the results were good. We have not performed the set-association method without construction of the two-locus interaction input variables, and it is therefore not possible to discuss the added value of using these in addition to the single locus parameters as a test statistic. We did see that the identification of the loci through the interaction parameter or single loci variables was dependent on the underlying model and the MAF, but the loci were consistently found via the interaction term in the ‘exclusive OR’ and ‘missing lethal genotype’ scenarios. The inability to deal with interacting loci that show no or weak main effects is an often mentioned disadvantage of the set association approach (Heidema et al. 2006). Heidema and colleagues state in their review that genetic interactions are only tested for the loci that are incorporated into the sum statistic. The fact that loci are incorporated into the sum statistic does, in our view, not deal with interactions between these loci since the separate test statistics are simply added, but we saw that this can be overcome by introducing the interaction test statistics prior to calculating the sum statistics. The current Sumstat version is not equipped to handle high-order interactions; only two-locus interaction parameters can be constructed. This was not a problem in our simulations, but could be a shortcoming in real usage. Selection of the results based on the permutation global p-values reduced the type I error perfectly. However, this was at the expense of a large loss of power to retrieve the causal loci for some models.

The results for the logic regression and MDR were similar. They performed well for all models, with the exception of the high frequency heterogeneity models with two different two-locus interactions, and showed a slightly impaired performance for the high frequency models with relatively small effect sizes. One disadvantage of our procedure is that we did not apply model selection techniques other than limiting the number of genetic parameters in the analysis. This meant that we could not evaluate the underestimation of the type II error and overestimation of the type I error in our study that was caused by not selecting the optimal model size and not using global p-value cut-off points to select the outcome of interest.

One of the strengths of the MDR that is often highlighted, is its high power to identify high-order interactions for loci without a main effect (Heidema et al. 2006). Our results confirm this for two-locus interactions. We also saw that this characteristic was not only limited to the MDR method, as the logic regression and Sumstat were capable of identifying these loci too. We have however not compared their power for a fixed type I error.

The number of replicates containing the loci under the null model decreased when applying a simple model size limiting strategy, by maximizing the number of leaves and loci, respectively. Limiting the model size to the smallest possible size considering the number of true causal loci was advantageous compared to a larger model. However, restraining the model size to a number lower than the actual number of causal SNPs present in the data obviously impaired the performance of the methods. The publicly available MDR software is currently limited to a maximum of 15 loci. When a larger number of loci might be involved in the genetic aetiology this limit would be too small.

There was a tendency in both methods to show a lower number of positive replicates under the null model when the frequency of the SNPs of interest decreased. This bias towards high frequency SNPs might warrant follow-up study to investigate more carefully the causes and consequences of this preference.

To judge if the loci included in the multi-locus combination act additively or not, the user of the logic regression and MDR method can evaluate the number of trees and the way the loci are included in the trees, and the graphically displayed case-control ratios for all multi-locus genotype combinations, respectively. This can become a daunting task, especially for the MDR method, when the number of loci included increases, and also when the number of datasets is large as in our case. We therefore did not explore this option in the current study, but we did find that for the logic regression all loci were consistently combined in one tree, pointing towards non-additive association between the loci and the case-control status in the logistic regression.

Finding similar results for the MDR method and the logic regression was not surprising. The similarities in methodology between the MDR and an extension of a recursive partitioning technique related to the logic regression approach were discussed by Bastone et al. (2004); the MDR method can be viewed as a special form of recursive partitioning technique (Bastone et al. 2004). One major distinction between the MDR and logic regression is that in the first approach tree growth is restricted to a single split in one tree. Based on this fact one might expect the logic regression to perform better under genetic heterogeneity because of its ability to create multiple splits and more than one tree. Also Heidema et al. (2006) state that recursive partitioning techniques, of which logic regression is one example, should be able to detect genetic heterogeneity. This was not confirmed in the results of this study, where the logic regression had difficulty identifying the causal loci for some four-locus models. The two-locus heterogeneity based model was tackled well by the logic regression as well as by the MDR method. Ritchie et al. (2003) identified the impaired performance under genetic heterogeneity of the MDR method. In their simulations the power of the MDR for a model comparable to our four-locus model was worse, probably reflecting the selection on the statistical significance of the output they used. The authors propose the use of cluster analysis or recursive partitioning techniques, to identify clusters of individuals with similar genetic backgrounds prior to performing the MDR to mitigate the low power in genetic heterogeneity, and state that more research into these methods is needed.

We applied the methods on a large number of different statistical interaction models for a variety of allele frequency possibilities, but our simulations were not exhaustive. The high frequency models with two two-locus interactions turned out to be the most challenging. They were not only characterized by interacting risk factors and genetic heterogeneity, but also by small expected marginal effects. So even though, in the individual situations of a statistical heterogeneity model or an underlying penetrance model with small marginal effects, the causal loci were retrievable, in the combined situation power was diminished. Furthermore, it is obvious that we have not set out to explore the limits regarding model complexity, effect size, allele frequency, sample size, and their combinations for these methods. The authors of the methods have touched upon these aspects. Future research could explore the performance of the methods for more complex underlying models with several one-locus effects and high-order interactions, gene-environment interactions and covariates.

We have simulated data for a small number of SNPs in order to evaluate the performance of multi-locus methods regarding the identification of interacting loci. With the increasing application of large candidate gene arrays and genome-wide SNP arrays it will be of importance to evaluate the strengths and weaknesses of multi-locus methods for handling large amounts of genetic data where linkage disequilibrium is likely to be present. The specific relative strengths and weaknesses in handling a large amount of SNP data for a diversity of methods, including logistic regression, the set association approach and the MDR, have been recently discussed by Heidema et al. (2006).

Ideally, one would limit the analysis of genetic association data to one or a few methods and apply them in a way that would enable capture of all the underlying signals from associated genes, whether acting singularly or interactively, and give insight into the underlying complex aetiology of the disease or trait of interest. Since every method has its own strengths and weaknesses, and within every method a diversity of approaches can exist, a multi-analytic approach could help in distinguishing between true and false positive findings. It is however important to apply the methods in the most optimal way, and to understand the limits of each method in order to correctly interpret the results. The results of this study can give researchers insights into how to apply the discussed methods best in practice, judge where they perform similarly, and help in interpreting the results of the different methods.

Acknowledgements

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References

This study was supported by the Netherlands Heart Foundation, Grant 2002B68. Sita Vermeulen has received a Frye Stipend and was supported by a travel grant from the Stichting Simonsfonds. Martin den Heijer is supported by a VENI grant from the Netherlands Organisation for Scientific Research (NWO). Jo Knight has an MRC Bioinformatics Training Fellowship, Grant G0501329. Pak Sham was supported by National Eye Institute Grant EY-12562, Hong Kong Research Grants Council CERG Grant HKU7669/06M and The University of Hong Kong Strategegic Research Theme on Genomics, Proteomics ans Bioinformatics. We would like to thank S. Newman for performing the Sumstat analysis.

References

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  • Albert, P. S., Ratnasinghe, D., Tangrea, J. & Wacholder, S. (2001) Limitations of the case-only design for identifying gene-environment interactions. Am J Epid 154, 687693.
  • Bastone, L., Reilly, M., Rader, D. J. & Foulkes, A. S. (2004) MDR and PRP: a comparison of methods for high-order genotype-phenotype associations. Hum Hered 58, 8292.
  • Brassat, D., Motsinger, A. A., Caillier, S. J., Erlich, H. A., Walker, K., Steiner, L. L., Cree, B. A., Barcellos, L. F., Pericak-Vance, M. A., Schmidt, S., Gregory, S., Hauser, S. L., Haines, J. L., Oksenberg, J. R. & Ritchie, M. D. (2006) Multifactor dimensionality reduction reveals gene-gene interactions associated with multiple sclerosis susceptibility in African Americans. Genes Immun 7, 310315.
  • Cho, Y. M., Ritchie, M. D., Moore, J. H., Park, J. Y., Lee, K. U., Shin, H. D., Lee, H. K. & Park, K. S. (2004) Multifactor-dimensionality reduction shows a two-locus interaction associated with Type 2 diabetes mellitus. Diabetologia 47, 549554.
  • Coffey, C. S., Hebert, P. R., Ritchie, M. D., Krumholz, H. M., Gaziano, J. M., Ridker, P. M., Brown, N. J., Vaughan, D. E. & Moore, J. H. (2004) An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene interactions on risk of myocardial infarction: the importance of model validation. BMC Bioinformatics 5, 49.
  • Cordell, H. J. (2002) Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Hum Mol Genet 11, 24632468.
  • Cordell, H. J., Todd, J. A., Hill, N. J., Lord, C. J., Lyons, P. A., Peterson, L. B., Wicker, L. S. & Clayton, D. G. (2001) Statistical modeling of interlocus interactions in a complex disease: rejection of the multiplicative model of epistasis in type 1 diabetes. Genetics 158, 357367.
  • Culverhouse, R., Suarez, B. K., Lin, J. & Reich, T. (2002) A perspective on epistasis: limits of models displaying no main effect. Am J Hum Genet 70, 461–471.
  • De Quervain, D. J., Poirier, R., Wollmer, M. A., Grimaldi, L. M., Tsolaki, M., Streffer, J. R., Hock, C., Nitsch, R. M., Mohajeri, M. H. & Papassotiropoulos, A. (2004) Glucocorticoid-related genetic susceptibility for Alzheimer's disease. Hum Mol Genet 13, 4752.
  • Hahn, L. W., Ritchie, M. D. & Moore, J. H. (2003) Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics. 19, 376382.
  • Heidema, G. A., Boer, J. M., Nagelkerke, N., Mariman, E. C., Van Der A. D. L. & Feskens, E. J. (2006) The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC Genet 7, 23.
  • Hoh, J. & Ott, J. (2003) Mathematical multi-locus approaches to localizing complex human trait genes. Nat Rev Genet 4, 701709.
  • Hoh, J., Wille, A. & Ott, J. (2001) Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res 11, 21152119.
  • Kim, S., Zhang, K. & Sun, F. (2003) Detecting susceptibility genes in case-control studies using set association. BMC Genet 4(Suppl 1), S9.
  • Kooperberg, C. & Ruczinski, I. (2005) Identifying interacting SNPs using Monte Carlo logic regression. Genet Epidemiol 28, 157170.
  • Kooperberg, C., Ruczinski, I., LeBlanc, M. L. & Hsu, L. (2001) Sequence analysis using logic regression. Genet Epidemiol 21(Suppl 1), S626S631.
  • Li, W. & Reich, J. (2000) A complete enumeration and classification of two-locus disease models. Hum Hered 50, 334349.
  • Ma, D. Q., Whitehead, P. L., Menold, M. M., Martin, E. R., Ashley-Koch, A. E., Mei, H., Ritchie, M. D., DeLong, G. R., Abramson, R. K., Wright, H. H., Cuccaro, M. L., Hussman, J. P., Gilbert, J. R. & Pericak-Vance, M. A. (2005) Identification of significant association and gene-gene interaction of GABA receptor subunit genes in autism. Am J Hum Genet 77, 377388.
  • Maitland-van der Zee, A. H., Turner, S. T., Schwartz, G. L., Chapman, A. B., Klungel, O. H. & Boerwinkle, E. (2005) A multilocus approach to the antihypertensive pharmacogenetics of hydrochlorothiazide. Pharmacogenet Genomics 15, 287293.
  • Marchini, J., Donnelly, P. & Cardon, L. R. (2005) Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet 37, 413417.
  • North, B. V., Curtis, D. & Sham, P. C. (2005) Application of logistic regression to case-control association studies involving two causative loci. Hum Hered 59, 7987.
  • Ritchie, M. D., Hahn, L. W. & Moore, J. H. (2003) Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol 24, 150157.
  • Ritchie, M. D., Hahn, L. W., Roodi, N., Bailey, L. R., Dupont, W. D., Parl, F. F. & Moore, J. H. (2001) Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet 69, 138147.
  • Ruczinski, I., Kooperberg, C. & LeBlanc, L. (2004) Exploring interactions in high-dimensional genomic data: an overview of Logic Regression, with applications. J Multiv Anal 90, 178195.
  • Thornton-Wells, T. A., Moore, J. H. & Haines, J. L. (2004) Genetics, statistics and human disease: analytical retooling for complexity. Trends Genet 20, 640647.
  • Wille, A., Hoh, J. & Ott, J. (2003) Sum statistics for the joint detection of multiple disease loci in case-control association studies with SNP markers. Genet Epidemiol 25, 350359.
  • Williams, S. M., Ritchie, M. D., Phillips III, J. A., Dawson, E., Prince, M., Dzhura, E., Willis, A., Semenya, A., Summar, M., White, B. C., Addy, J. H., Kpodonu, J., Wong, L. J., Felder, R. A., Jose, P. A. & Moore, J. H. (2004) Multilocus analysis of hypertension: a hierarchical approach. Hum Hered 57, 2838.