• Family based association study;
  • gene–gene interaction;
  • epistatic interaction;
  • trio logic regression;
  • logicFS;
  • autism


  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References

Ensemble methods (such as Bagging and Random Forests) take advantage of unstable base learners (such as decision trees) to improve predictions, and offer measures of variable importance useful for variable selection. LogicFS has been proposed as such an ensemble learner for case-control studies when interactions of single nucleotide polymorphisms (SNPs) are of particular interest. LogicFS uses bootstrap samples of the data and employs the Boolean trees derived in logic regression as base learners to create ensembles of models that allow for the quantification of the contributions of epistatic interactions to the disease risk. In this article, we propose an extension of logicFS suitable for case-parent trio data, and derive an additional importance measure that is much less influenced by linkage disequilibrium between SNPs than the measure originally used in logicFS. We illustrate the performance of the novel procedure in simulation studies and in a case study of 461 case-parent trios with autistic children.


  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References

In association studies concerned with complex diseases, individual SNPs often only exhibit a small effect size. However, it is hypothesized that interactions of several SNPs and possibly gene–environment interactions might more strongly influence the risk of disease (Garte, 2001). Since the number of possible interactions between genetic markers and between genetic and environmental variables is vast, statistical procedures are required that can cope with this high-dimensional search space. Several methods for tackling this task have been proposed, including exhaustive searches based on multiple testing (Marchini et al., 2005; Goodman et al., 2006) and multifactor dimensionality reduction (Ritchie et al., 2001; Hahn et al., 2003; Ritchie et al., 2003), as well as machine learning methods such as Random Forests (Breiman, 2001; Lunetta et al., 2004; Bureau et al., 2005; Chen et al., 2007) and neural networks (Lucek & Ott, 1997; North et al., 2003; Ritchie et al., 2003b; Tomita et al., 2004). Besides multifactor dimensionality reduction, the restricted partition method proposed by Culverhouse et al. (2004) and logic regression introduced by Ruczinski et al. (2003) have been specifically developed for analyzing SNP data. Overviews and discussions on some of these procedure can be found in Heidema et al. (2006), McKinney et al. (2006), and Musani et al. (2007).

Logic regression has performed well in SNP association studies (Kooperberg et al., 2001; Witte & Fijal, 2001; Etzioni et al., 2004; Ruczinski et al., 2004; Andrew et al., 2008; Harth et al., 2008; Justenhoven et al., 2008; Suehiro et al., 2008), but has also been applied in other biomedical research areas such as the identification of regulatory motifs (Keles et al., 2004), HIV studies (Segal et al., 2004), DNA methylation (Feng et al., 2005), and biomarker detection (Vaidya et al., 2008). In addition, several modifications and extensions of logic regression have been proposed. Logic regression has been embedded into a Bayesian framework (Kooperberg & Ruczinski, 2005; Clark et al., 2007), and the simulated annealing algorithm employed in logic regression to search for interactions has been replaced by other probabilistic search methods such as genetic programming (Nunkesser et al., 2007) and evolutionary algorithms (Clark et al., 2005, 2008). Most recently, Li et al. (2010a) adapted logic regression, which was originally developed for population-based association studies, to the analysis of case-parent trio data.

Similar to classification and regression trees (Breiman et al., 1984), the Boolean trees used in logic regression models are unstable, that is, small changes in the data can lead to very different trees. Ensemble methods such as bagging (Breiman, 1996) and Random Forests (Breiman, 2001) take advantage of unstable predictors to improve predictions, and more importantly in the context of SNP association studies, offer measures of variable importance, which can improve variable selection. However, none of these methods enable a direct quantification of the importance of combinations of variables. Based on this rationale, Schwender & Ickstadt (2008) proposed a procedure called logicFS (logic Feature Selection) in which a bagging version of the original logic regression is employed to identify disease-associated SNP interactions. The Boolean trees are used as base learners in this ensemble method, which allows for the quantification of the relevance of the detected SNP interactions (i.e., not only of individual SNPs) by providing an importance measure similar to one of the single variable importance measures determined in Random Forests.

In this manuscript, we adapt logicFS to case-parent trio data to stabilize the search for disease-associated interactions in such family-based data. We note that while much thought and effort has been put into the development of methods for testing candidate SNP interactions in case-parent trios (e.g. Schaid, 1999; Lunetta et al., 2000; Cordell & Clayton, 2002; Culverhouse et al., 2002; Cordell et al., 2004; Baksh et al., 2006, 2007; Kotti et al., 2007), few methods directly searching for higher order SNP interactions in family based data have been proposed (Martin et al., 2006; Li et al., 2010a). Moreover, we introduce a new importance measure that takes linkage disequilibrium (LD) between markers into account. The resulting approach called trioFS is then applied to simulated data, and to case-parent trios with autistic children.


  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References

Logic Regression

Logic regression introduced by Ruczinski et al. (2003) is a classification and regression procedure that adaptively searches for Boolean combinations Ck, k= 1, …, K of binary covariates (using the Boolean operators AND, OR, NOT) and incorporates these terms into a generalized linear model

  • image(1)

where the choice of the link function g depends on the type of the response Y. If, for example, Y is binary, then g is the logit function, which is typically used for logistic regression models. The logic regression framework also includes many other forms of regression, such as linear models or Cox proportional hazard models.

To find the best logic regression model as described in equation (1), Ruczinski et al. (2003) employ a two-step procedure. First, the best scoring models of different sizes (as determined by the number of covariates used in the Boolean terms) are derived, and then model selection procedures such as permutation tests or cross-validation are used to determine the optimal model size. The search for good scoring models is carried out via simulated annealing, a stochastic search algorithm suitable for global optimization problems, and a tree-representation of the logic expressions Ck. Using a set of moves based on this tree-representation, variables and the AND- and OR-operators can be added to, removed from, or alternated in the logic expressions such that each logic expression can be reached from each other expression in a finite number of moves. In each annealing step, the new model is compared with the current logic regression model by a score function (for example, the deviance if Y is binary). If the newly proposed model is an improvement compared to the current model, it gets accepted. Otherwise, an acceptance probability based on the values of the score function for these two models is computed. This acceptance probability also considers how far the annealing has progressed, that is, it ensures that towards the end of the search newly proposed logic regression models are unlikely to get accepted if they score worse. When logic regression is applied to SNP data, each SNP S is typically split up into two binary variables SD and SR, coding for a dominant and a recessive effect of S, respectively. For a detailed description, see Ruczinski et al. (2003).

Recently, Li et al. (2010a) introduced an extension of logic regression enabling the analysis of case-parent trio data. As in a genotypic transmission disequilibrium test (Schaid, 1996; Cordell et al., 2004), trio logic regression uses the affected proband as a case, and the other Mendelian children (as derived from the parents' genotypes) as matched pseudo-controls. Since there are 4m− 1 matched pseudo-controls for each case when considering m unlinked SNPs (Cordell et al., 2004), Li et al. (2010a) restrict the analysis to the 1:3 matching typically employed when testing individual SNPs. This is achieved by randomly ordering the genotypes of the three Mendelian children at each marker, and concatenation of those genotypes to generate three pseudo-controls. When SNPs are in LD, Li et al. (2010a) take the haplotype structure into account, designate a case phase for each trio, and select three random pseudo-controls under that phase scenario. As in logic regression, the genotypes are then described by two binary variables in dominant and recessive coding, and a conditional logistic likelihood is used in the search for the logic regression model that best discriminates cases and pseudo-controls.

Logic Feature Selection (logicFS)

In logicFS (Schwender & Ickstadt, 2008), logic regression is applied to several bootstrap samples drawn from the subjects in a case-control study to detect high-order SNP interactions associated with the case-control status. Thus, bagging (Breiman, 1996) with base learner logic regression is used in logicFS to stabilize the search for such interactions. To identify the interactions composing the logic expressions, and hence the logic regression models, each logic expression Ckb, b= 1, …, B, in each of the B models (or the complement of Ckb, if the respective parameter estimate inline image is negative) is transformed into a disjunctive normal form, that is, an OR-combination of AND-combinations. Each of these AND-combinations in this disjunctive normal form represents one of the interactions.

Since some of the detected interactions will have a larger effect on the disease risk than others, Schwender & Ickstadt (2008) also propose an importance measure for quantifying the relevance of each identified interaction, which is related to one of the variable importance measures (VIM) used in Random Forests (Breiman, 2001). For each of the interactions comprised by a logic regression model, the value of the importance measure is computed by predicting the case-control status of the out-of-bag observations, that is, the subjects that are not part of the bootstrap sample used to fit this model. This is done for both the original model as it has been found by logic regression, and a reduced model, which is derived by removing the interaction of interest from the original model and refitting the parameters in the model with the reduced logic expressions. For each iteration b= 1, …, B, and each interaction Pj, j= 1, …, J, appearing in at least one of the B logic regression models, this leads to two numbers of correctly classified out-of-bag observations, denoted by Nb and N(−j)b. The importance of Pj is then quantified as

  • image

where inline image is the set of all interactions comprised by the bth logic regression model.

Logic Feature Selection for Trios (trioFS)

Similar to the analysis of data from case-control studies, applying trio logic regression to several subsets of the case-parent trio data can strengthen the identification of disease-associated SNP interactions. When randomly drawing these subsets, it is necessary to take the matching into account. Thus, we do not sample from the cases and pseudo-controls per se, but sample the case-pseudo-control status within each of the case-parent trios. Furthermore, we decided to use subsampling, that is, to randomly draw a certain percentage of trios (typically, 63.2% of the trios, as this is the percentage of subjects expected to be in a bootstrap sample) instead of using bootstrap sampling (i.e., randomly drawing with replacement). This is foremost for computational efficiency, as subsampling works as well as bagging, but is computationally cheaper (Buehlmann & Yu, 2002).

In the computation of an importance measure, it would again be possible to employ the number of correct predictions of the case-(pseudo-)control status. However, since the controls are artificial, and their number is three times as large as the number of cases, we use another statistic to measure the goodness of the fitted model, namely the predictive log-likelihood pred (Schmid & Hothorn, 2008). Thus, the parameter estimate inline image for the bth trio logic regression model and the cases and matched pseudo-controls in the set inline image containing the out-of-bag observations of the bth iteration are employed to compute the predictive log-likelihood

  • image

where c(0)bi∈{0, 1} is the value of the logic expression C1b in the bth model for the case in the ith trio, and c(p)bi (p= 1, 2, 3) are the values of C1b for the matched pseudo-controls. This predictive log-likelihood is calculated for all B logic regression models, and the importance of an interaction Pj is calculated by removing this interaction from the models that contain Pj and computing the log-likelihood inline image of the respective reduced model. The importance of Pj is then given by

  • image

where the factor -2 is used to be in accordance with a likelihood ratio test.

A problem with importance measures such as the ones of Random Forests and logicFS, which is similar to the multicollinearity problem in linear regression, is that the importance can be lowered substantially when the corresponding SNPs are in strong LD (Lunetta et al., 2004; Nicodemus & Malley, 2009). If, for example, an interaction between the SNPs S1, S2, and S3 is disease-associated, and S1 is in strong LD with S4, then the actual interaction will appear in some of the logic regression models, and the interaction of S4, S2, and S3 is contained in other models. Hence, the actual interaction will show a reduced importance. To adjust for LD, we identify for each interaction Pj the logic regression models that contain interactions of the same number of terms as Pj, that only differ from Pj by SNPs that are in tight LD with the SNPs in Pj, where each of the replaced SNPs in Pj must have exactly one counterpart in the other interaction. If the bth model contains such a neighbor interaction, we replace it by Pj, and refit the changed model to estimate β*1b. We then remove Pj from the model to compute inline image, and the improvement

  • image(2)

of the bth model due to Pj. The adjusted importance measure is then given by

  • image(3)

where inline image is the set containing all neighbor interactions of the interactions composing the bth logic regression model.

Another problem is that an interaction might be identified by logic regression that consists of the actual disease-associated interaction and one or on rare occasion more additional SNPs that only slightly increase the disease risk in the sample. While over-fitting is in general not a problem for the computation of the improvements as they are determined based on the out-of-bag observations, the importance of the actual interaction itself will be decreased nonetheless. This problem can be solved in a similar way as the LD problem, that is, by replacing the extended interactions in the logic regression models, and analogous to equation (2), by calculating the improvements inline image that would have been due to the actual interaction, had it been in the model instead of the extended interaction. The resulting importance measure adjusted for both LD and too large interactions is given by

  • image(4)

where inline image is the set containing all interactions comprised in any of the B logic regression models that in interaction with another SNP make up one of the interactions in the bth model. We typically restrict inline image to interactions containing one additional variable, since we only rarely observe that an interaction intended to be disease-associated is extended by more than one interaction term in our simulation studies. However, the publicly available software also allows for the extension of more than one additional variable in the interaction terms.

To demonstrate the proposed importance measures (3) and (4) in an example, assume that trioFS with B= 5 iterations is applied to case-parent trio data, and that the disease-associated interaction P1 consisting of the SNPs S1, S2, and S3 is found in iterations 1 and 3, whereas interaction P2 composed of S4, S2 and S3 is identified in iterations 2 and 5, as S1 and S4 are in strong LD. In this case, inline image will be zero for b= 2, 4, 5, and inline image will be zero for b= 1, 3, 4, but inline image will be larger than zero for b= 2, 5, and inline image will be larger than zero for b= 1, 3. If we, moreover, assume that in the fourth iteration an interaction P3 consisting of P1 and S5 is detected, then inline image, but inline image, as P3 is not an extension of P2.

We note that while the main goal of the trioFS procedure is to generate hypothesis, not to carry out hypothesis tests per se, it is possible to define and generate permutation-based p-values for the SNP interactions based on the importance measures. In such a permutation test, the 1:3 matching has to be taken into account by randomly assigning the case status in a trio to one of the four Mendelian children derived from the parents' genotypes. A simple approach for the p-value estimation is to apply trioFS to a sufficiently large number of permutations of the case-pseudo-control status across trios, and compare the values of the importance measures in these permutations with the observed values from the original application. However, such an approach would be computationally challenging even in small data sets. An alternative and much less time-consuming procedure, which typically leads to almost identical p-values (see the supplementary material to Schwender et al., 2010), is to employ the logic expressions found in the original analysis in all applications to the permuted case-pseudo-control status. In each iteration of this procedure, we permute the case-pseudo-control status, apply a conditional logistic regression to each of the B bootstrap samples using the respective logic expression from the original analysis as predictor, and compute the values of the importance measures based on these refitted models and the corresponding out-of-bag observations. The permutation-based p-values are then given by the fraction of the importance for an interaction determined in these iterations that are larger than or equal to the original importance.

The computation of the p-values can be further accelerated by making use of the alternative representation of the conditional likelihood proposed by Li et al. (2010b). Instead of considering all n trios in the maximization of the log-likelihood separately, one aggregates all trios showing the same value of the logic expression for the case and the same number of pseudo-controls for which the logic expression is true. Since there are only eight such case-pseudo-control combinations, and two of those do not contribute to the log-likelihood, and thus to the maximization procedure, the conditional log-likelihood can be maximized by considering six instead of n components, leading to a substantial reduction in required computing time.


  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References

To illustrate the performance of trioFS, we applied the method to data from a simulation study considering different effect and sample sizes, and to a case study of parents with autistic children.


As a first set-up, we simulated 100 data sets each consisting of genotypes for 100 unlinked SNPs, typed in 1000 case-parent trios. In each of these data sets, the SNPs S3 and S7 were simulated such that the chance of being a case was 3 times larger for subjects exhibiting at least one copy of the variant allele at both S3 and S7. Thus, a subject showing the interaction S3DS7D had a 3-fold increase in the chance of being a case, where the symbol denotes the logic AND-operator. The genotypes of the other 98 SNPs were drawn under the assumption of no association with the outcome.

We applied trioFS with B= 20 iterations to these 100 data sets and computed inline image, and inline image was identified as the (usually by far) most important interaction in all applications of trioFS when considering inline image, and in all but one application when using inline image to rank the interactions. In fact, S3DS7D is the only interaction that was detected in all applications. In general, S3DS7D was considered very important by all three metrics inline image, and inline image. However, with the exception of one application in which inline image, the value of inline image was typically substantially larger than the value of inline image (Fig. 1). The reason for this is that although S3DS7D exhibits the largest importance, interactions composed of S3DS7D and one other SNP were also frequently identified in some of the iterations of trioFS, reducing the improvement due to S3DS7D as quantified in equation (2) in such an iteration to zero, and thus, decreasing the overall importance of S3DS7D (see Table 1 for an example output of trioFS). Adjusting for LD has no effect on inline image in this application, since all SNPs were simulated independently from each other. Thus, no pair of SNPs found in interaction with S3DS7D exhibited an r2-value larger than 0.7, which was used as the defining threshold for LD in equation (3) to quantify inline image (see Table 1).


Figure 1. Scatter plot for the values of inline image and inline image for the interaction term S3DS7D, derived from the applications of trioFS to 10 simulated data sets where disease risk is determined by said interaction. The values for importance measure inline image are always smaller than the corresponding values for inline image, illustrating the benefit of accounting for potentially over-fitted interactions. Since SNPs were not in linkage disequilibrium, the values for the importance measure inline image are the same as the ones for inline image, and are omitted from the plot.

Download figure to PowerPoint

Table 1.  Values of inline image and the adjusted importance measures inline image and inline image for the top five interactions found in the application of trioFS to one of the simulated trio data sets that resulted in the value of inline image closest to the median of the inline image values
Interactioninline imageinline imageinline image
  1. Note: The interaction term S3DS7D that specifies disease risk in this simulation is the most important finding. The other interactions also contain this term, and thus, the variable importance measure inline image that allows for this type of over-fitting is boosted. Since SNPs were not in linkage disequilibrium in this simulation, the values for the importance measures inline image and inline image are identical.


To investigate whether trioFS is also able to detect S3DS7D when this interaction has a smaller effect size, we simulated S3 and S7 such that the odds of being a case were 2.5, 2, or 1.5 times larger for subjects showing S3DS7D. For each of these three odds ratio, 100 data sets were generated, each consisting of 1000 case-parent trios typed at S3, S7 and 98 additional independent SNPs intended to have no effect on the disease risk. Additionally, we simulated 100 data sets consisting of 500 trios for each of the four odds ratios 3, 2.5, 2, and 1.5. We then applied trioFS with B= 20 iterations to each of these data sets.

The simulation reveals that sample sizes of 1000 trios or fewer had to be considered insufficient for a study to detect interactions with odds ratios of 1.5 or smaller (Table 2).

Table 2.  Results of the applications of trioFS to the 100 data sets from each of eight simulation scenarios in which the interaction S3DS7D is intended to be disease-associated
Triosinline imageinline imageinline imageinline image
  1. Note: This table contains the numbers of applications in which this interaction is found, is identified as the most important interaction or amongst the five most important interactions, respectively, the numbers of applications in which an extension of S3DS7D is identified as most important interaction or amongst the five most important interactions, respectively, and the numbers of applications in which S3DS7D shows a Bonferroni corrected p-value p smaller than 0.05 or equal to zero, and an extension exhibits a p-value pExt smaller than 0.05, using inline image or, in brackets, inline image as importance measure.

Top 10 (0)0 (0)44 (20)97 (72)99 (75)100 (96)97 (91)100 (99)
Top 51 (1)0 (0)54 (37)100 (89)99 (87)100 (98)100 (100)100 (100)
Ext. Top 10 (0)0 (0)13 (20)0 (22)1 (24)0 (4)1 (7)0 (1)
Ext. Top 51 (1)1 (1)67 (67)98 (98)98 (98)99 (99)100 (100)99 (99)
p≤ 0.050 (0)0 (0)0 (0)67 (60)59 (64)100 (99)99 (96)100 (100)
p= 00 (0)0 (0)0 (0)46 (36)32 (37)100 (99)94 (92)100 (99)
pExt≤ 0.050 (0)0 (0)0 (0)7 (7)1 (1)61 (72)18 (47)90 (97)

TrioFS detected S3DS7D only in 8 of the 100 applications to the data sets with 500 trios and 33 times in the data sets consisting of 1000 trios, where in only the former applications S3DS7D ranked once under the five interactions with the largest values of either inline image or inline image. Interactions composed of S3DS7D and another SNP were detected in 40 or 83 of the applications, respectively, but they were detected just once amongst the five top-ranking interactions in both simulation scenarios.

In all but one application to the data sets from the simulation scenarios with odds ratios of 2.5 and 3, S3DS7D was detected and ranked first when considering inline image. In a few of the analyses, S3DS7D ranked not first, but typically second or third when basing this ranking on inline image, where the scenarios with the 500 trios performed worse than the scenarios with the 1000 trios. Usually, all top five interactions contained S3DS7D (Table 2).

Employing inline image in the applications to the data sets from the simulation scenario with 1000 case-parent trios and an odds ratio of 2 led to the detection of S3DS7D as the top-ranking interaction in 97% of the cases, whereas S3DS7D itself or an extension of it was found as the most important interaction in 72 or 22 of the applications, respectively, if inline image was used. In the six remaining applications, three-way interactions of the SNPs with no main effect showed up as most important, but in these cases at least three of the other four top ranking SNPs were either S3DS7D or an extension of it, that is, a three-way interaction containing S3DS7D. When considering the data sets consisting of 500 case-parent trios, S3DS7D itself was found in 87 of the applications, and in another 12 analyses it was identified in interaction with another SNP. When the ranking is based on inline image, both S3DS7D itself and extensions of it ranked first in 20 of the applications, and represented the most important interaction in 44 and 13 of the analyses, when considering inline image, respectively.

We also computed p-values based on 10,000 permutations of the case-pseudo-control status for all identified interactions and adjusted for multiple comparisons using the Bonferroni correction. The term S3DS7D was identified as significant in virtually all analyses based on 1000 trios, when the effect size was assumed to be 2.5 or larger (Table 2). Frequently, none of the 10,000 permuted importances were actually larger than the observed (un-permuted) importance of S3DS7D. The p-value was smaller than 0.05 in about 60% (inline image) or 67% (inline image) of the applications, when an odds ratio of 2 was assumed. When analyzing 500 trios, an odds ratio of 3 was necessary to systematically achieve significance. Virtually all interactions with a p-value smaller than 0.05 were either S3DS7D or an extension of it. An exception is S7D, which showed up significant in some of the applications.

To evaluate whether trioFS is also able to detect three-way interactions, S3DS5DS7D was simulated such that it exhibits an odds ratio of 3, 2.5, or 2, and 97 SNPs were randomly drawn under the assumption of no association with the outcome. In this way, six sets consisting of 100 data sets were generated, where the data sets in three of these sets contained 500 case-parent trios, and in the other sets 1000 trios. We then applied trioFS with B= 20 iterations to all of these data sets and computed inline image and inline image.

Not surprisingly, even larger effect sizes are required to detect the higher order interaction (Table 3). Even for odds ratios of 2, neither S3DS5DS7D nor extensions of it were identified. However, in all but one simulation scenario with odds ratio of 2.5 and 3, S3DS5DS7D was detected by trioFS. In almost any application to the data sets with 1000 trios, this interaction ranked first when employing inline image, and it ranked first in most analyses when considering inline image. Only in a few of the applications, the value of inline image for S3DS5DS7D was substantially larger than the value of inline image, as just a few extensions of S3DS5DS7D appeared in the applications of trioFS. Instead the two-way interactions contained in S3DS5DS7D, that is, S3DS5D, S3DS7D, and S5DS7D, showed up in almost any application of trioFS. In the settings with 500 trios, frequently at least one of these two-way interactions had a higher importance than S3DS5DS7D, whereas the studies with 1000 trios reliably identified the three-way interaction as the most important one. This can also be summarized using the permutation-based p-values for S3DS5DS7D, which were smaller than 0.05 in virtually any application to the 1000 trios, and zero in most of the applications. On a positive note for the smaller studies, in many instances more than just one of the two-way interactions and S3DS5DS7D appeared among the top five SNPs and with a p-value smaller than 0.05, suggesting that this three-way interaction might be important for the disease risk prediction.

Table 3.  Results of the applications of trioFS to the 100 data sets from each of eight simulation scenarios in which the interaction S3DS5DS7D is intended to be disease-associated
Triosinline imageinline imageinline image
  1. Note: This table contains the number of applications in which this interaction is found, is identified as the most important interaction or amongst the five most important interactions, respectively, the number of applications in which one of the three two-way interactions contained in S3DS5DS7D is identified as most important interaction or amongst the five most important interactions, respectively, and the numbers of applications in which S3DS5DS7D shows a Bonferroni corrected p-value p smaller than 0.05 or equal to zero, and an extension exhibits a p-value pPruned smaller than 0.05, using inline image or, in brackets, inline image as importance measure.

Top 10 (0)0 (0)35 (47)98 (89)2 (36)99 (100)
Top 50 (0)0 (0)79 (70)99 (96)89 (90)100 (100)
Pruned Top 10 (0)0 (0)62 (44)1 (10)97 (63)1 (0)
Pruned Top 50 (0)0 (0)100 (100)99 (99)100 (100)98 (98)
p≤ 0.050 (0)0 (0)37 (55)99 (98)46 (75)100 (100)
p= 00 (0)0 (0)6 (10)96 (90)0 (3)99 (100)
pPruned≤ 0.050 (0)0 (0)59 (42)45 (41)93 (66)7 (7)

In the final simulation set-up, we investigated the performance of trioFS and the differences between the importance measures when SNPs are in strong LD. We examined two specific settings for this simulation study, but also refer the reader to the autism case study discussed in the following section, which we believe is a particularly nice illustration of the differences between inline image and inline image when SNPs are in strong LD. For each of these settings, one considering an odds ratio of 2.5 for S3DS7D, the other an odds ratio of 3, we simulated 100 data sets consisting of 100 SNPs typed at 1000 trios. This time, we generated two LD-blocks of SNPs, one consisting of S2, S3, and S4, and the other of S6, S7 and S8. The pairwise r2-values within these blocks were larger than 0.99. The remaining 94 SNPs were randomly drawn.

Usually a minimum of three, and in most applications four of the top five interactions were composed of two SNPs, one from each of the two LD-blocks containing S3 and S7, with permutation-based p-values typically equal to zero, but always less than 0.05 (see Table 4 for an example). The other top five interactions were always three-way interactions consisting of two SNPs from these LD-blocks and another SNP which only slightly contributed to the effect of the interaction. The term S3DS7D is sometimes found as the most important interaction, however, frequently another interaction consisting of either S2D, S3D or S4D, and S6D, S7D or S8D ranks first.

Table 4.  The five interactions with the largest values for inline image determined in an application of trioFS to one of the simulated data sets in which S3DS7D shows an odds ratio of 3
Interactioninline imageinline imageinline image
  1. Note: Here, S3 is in strong LD with S2 and S4, and S7 is in strong LD with S6 and S8. The numbers in the brackets are the Bonferroni corrected p-values corresponding to the respective importance measure.

S3DS7D3.69 (0.000)9.05 (0.00)9.05 (0.00)
S2DS7D2.63 (0.000)6.61 (0.00)9.16 (0.00)
S3DS6D1.79 (0.000)6.57 (0.00)9.80 (0.00)
S3DS8D0.97 (0.001)6.56 (0.00)7.84 (0.00)
S4DS6DSC89R0.84 (0.043)0.92 (0.02)0.92 (0.02)

The identification of different two-way interactions consisting of SNPs from the two LD-blocks containing the truly associated SNPs led to a reduced value of the respective inline image. Employing inline image however resulted in a substantially increased importance (see Table 4). If also three-way interactions composed of one of these two-way interactions and another SNP were found, the importance of this two-way interaction was further increased when using inline image. For example, several three-way interactions containing either S2DS7D or S3DS7D were identified in the analysis which led to the results presented in Table 4, but no higher-order interaction consisting of S3DS7D was identified. Thus, the importances of the former interactions, but not of the latter, were increased when using inline image.

Case Study

In this section, we consider 461 autistic children and their parents from 289 families recruited by the Autism Genetic Resource Exchange (AGRE;, a collaborative gene bank created by Cure Autism Now (CAN) and the Human Biological Data Exchange (HBD) to advance genetic research in autism spectrum disorders by consolidating large numbers of families into one collection. Genetic biomaterials and clinical data were obtained for families with at least one offspring diagnosed with an Autism Spectrum Disorder based on evaluation by the Autism Diagnostic Interview-Revised (ADI-R) and the Autism Diagnostic Observational Schedule (Geschwind et al., 2001). Cases were included if they had an ADI-R diagnosis of Autism, and data for both parents were available.

Two of the available 331 SNPs were excluded from the analysis since they were almost monomorphic. Further, ten of the 461 trios were removed as more than 2% of the SNPs in each of these trios exhibited Mendelian errors. The haplotype-based procedure proposed by Li et al. (2010a) was used to impute the missing genotypes, and to transform the case-pseudo-control data into a format suitable for trio logic regression. We then applied trioFS with B= 20 iterations to these data.

The most important interaction detected by trioFS was a three-way interaction of rs11017112, rs7082126, and rs11017128, all showing the homozygous reference genotypes (Table 5). When considering the adjusted importance measure, a three-way interaction of the latter two SNPs and rs11017114 (also represented by the binary variable using dominant coding, that is, the variable indicating at least one variant allele) exhibits the second largest importance. All these SNPs are from the gene Glutaredoxin 3 (GLRX3) on chromosome 10.

Table 5.  The five interactions with the largest value for inline image derived from the application of trioFS to the autism data set
Interactioninline imageinline imageinline imageinline image95% CI
  1. Note: For conciseness, the SNP rs-IDs are abbreviated as follows: S180: rs11017112; S185: rs7082126; S192: rs11017128; S183: rs11017114; S193: rs4751178; S143: rs553822; S148: rs502862.

SC180DSC185DSC192D8.1815.4615.464.44(3.26, 6.05)
SC180DSC185D5.125.1215.012.84(2.10, 3.85)
S183DS185DS192D5.069.9415.172.58(1.88, 3.53)
S183DS185DS192DS193D4.954.954.953.43(2.45, 4.80)
SC143DSC148D3.953.954.402.41(1.81, 3.22)

Since the haplotype-based imputation of Li et al. (2010a) is probabilistic and data sets created by this procedure can differ, we generated ten data sets with this method and applied trioFS to each. All these applications led to the detection of at least one of the three-way interactions mentioned above, with a permutation-based p-value of zero throughout, where sometimes inline image was replaced by inline image, or inline image by inline image (and r2= 0.985 for these two SNPs). Typically, one of these three-way interactions was also identified as the most important one. Other interactions between SNPs from the same gene and/or SNPs from different genes or chromosomes were also picked up occasionally, but usually not for multiple of these data sets, hinting at spurious associations.

Since epistasis is usually defined as interactions between SNPs in different genes and/or genomic regions, we also analyzed a subset of the 329 SNPs that was previously considered elsewhere (Bowers et al., in preparation). Briefly, this subset consists of 138 independent SNPs showing pair-wise r2-values smaller than or equal to 0.2, and each glutathione-related gene is represented by the marker that has the largest estimated marginal effect size. In addition, all SNPs with a marginal p-value less than 0.1 were also included in the analysis. As before, we generated 10 case-pseudo-control data sets by applying the procedure of Li et al. (2010a) to the subset of 138 SNPs (the procedure is also applicable for “degenerate” haplotypes of size 1, that is, individual SNPs), and analyzed these data sets with trioFS. Even though we strongly biased the selection of SNPs, the application of trioFS did not reveal interesting interactions.


  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References

One of the main objectives in SNP association studies is the detection of interacting SNPs that explain some of the variability in the response of interest. In this manuscript, we have adapted a method suitable for finding such interactions in population-based case-control studies to case-parent trio designs. We have further proposed and motivated an adjustment of previous measures of interaction importance that corrects for LD, and potentially over-fitted interactions.

Similar to importance measures from other approaches such as Random Forests, the importance measures proposed here are used to rank the interactions detected by trioFS by their importance for the disease risk, and to assess which of the interactions detected are relevant risk factors. This is an appealing feature, however, it should be considered a hypothesis-generating rather than a hypothesis-testing procedure, and ideally, interactions of interest should be validated on an independent data set if such data are available. This is obviously true for other methods as well, and the task to devise statistical methods that help to characterize interactions after discovery, and to quantify their contribution to the variation in the phenotype, is an active research area in the community (Edwards et al., 2010; Greene et al., 2010; Nicodemus et al., 2010, We also note that for our procedure, it is possible to compute permutation-based p-values for the importances of the detected interactions, but we recommend that these p-values should be considered foremost as descriptive statistics.

The value of the proposed importance measure that adjusts for LD is computed for each neighboring interaction individually, although the SNPs forming these interactions are interchangeable. Another idea might be to jointly consider interactions that differ from each other only by SNPs in tight LD, and compute one importance for these interactions between blocks of SNPs. It is an open question whether the proposed importance measures for SNP interactions can also be used for this purpose, or if a more sophisticated measure is required.

Since not all SNPs composing a disease-associated interaction are necessarily of equal importance – some SNPs might be responsible for most of the effect, others might only lead to a marginal improvement in predicting disease risk – it might be beneficial to also develop methods to quantify how much each of the SNPs contribute individually to any particular interaction, and thus to the disease risk. Univariate testing is certainly not the ultimate solution for this problem, as some of the SNPs might not have a main effect at all, and only show an effect when interacting with other SNPs. Employing variable importance measures such as the ones of Random Forests in (trio) logic expression might lead to more reliable results, as such measures take the multivariate data structure into account.

In applications to simulated data, trioFS is almost always able to detect two- and three-way interactions even for small sample sizes when the effect sizes are large. From our experience with logic regression for population-based data and logicFS, we initially expected that trioFS should also be able to detect interactions with odds ratios considerably smaller than 2. However, in our simulations we have frequently seen “spurious” signals, that is, interactions of null SNPs with large (estimated) odds ratios. A reason for this is due to the trio design, that is, the comparison of observed proband and the pseudo-controls. If, for example, only a single marker is considered, a trio with parents of the same homozygous genotype does not contribute to the likelihood, as all Mendelian children have the same genotype. Thus, in simulations with small data sets, large odds ratios can easily arise, although their actual significance (if assessed individually by a hypothesis test) would be low.

In an analysis of genotype data from children with autism and their parents, trioFS detects two three-way interactions each composed of SNPs from the same gene that appears to be associated with autism. After removing SNPs in LD by selecting one representative SNP for each gene, trioFS however does not identify interactions that give rise to large values of importance measures. This is not too surprising - interacting markers without main effects would not have entered this analysis, and markers with strong marginal effects might dominate and mask a potential epistatic effect.

In our computing environment, the application of trioFS to the autism data set took about 6.5 h, where each iteration of trioFS, that is, each application of trio logic regression, took about 19 min, and the computation of the importance measures a few seconds. In general, the computation time of trio logic regression depends on the number of trios and the number of iterations used in the underlying stochastic search algorithm (simulated annealing). Choosing an appropriate number of iterations typically requires some trial and error, and needs to reflect the size of the search space, that is, should be a function of the number of markers investigated. We note that the total computing time can be cut substantially, since the applications of trio logic regression to the different subsamples of the data can be parallelized. The updated R package logicFS containing trioFS will provide the appropriate functionality for performing such parallel computations.

Nonetheless, the analysis of hundreds of thousands of SNPs would require a way too vast number of iterations, rendering an application of trio logic regression and thus of trioFS to genome-wide association studies impractical. However, we do not believe that the assessment of potential higher order interactions using hundreds of thousands of SNPs without prioritization is desirable, as the required effect sizes to detect such interactions had to be unrealistically large. Logic regression was initially developed for candidate SNP studies and can handle up to a few thousand markers, and thus, the same applies to trio logic regression and trioFS. In particular, if parallelization is employed, it might also be possible to analyze tens of thousands of SNPs (as they might, e.g., appear in exome sequencing) with a version of (trio) logic regression adapted to this new situation, as recent first attempts with such a modified logic regression show.

Software for trioFS will be available in an updated release of the R package logicFS. This package is freely available at, the webpage of the Bioconductor project (Gentleman et al., 2004).


  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References

Support was provided by grant SCHW15-08 1/1 of the Deutsche Forschungsgemeinschaft (HS), CDC grant "Centers for Autism and Developmental Disabilities Research and Epidemiology" DD06-003 U10 DD000183 (MDF), and R01 HL090577 from the National Heart, Lung, and Blood Institute (IR). We would also like to acknowledge the families who participated in AGRE. The AGRE collection Principal Investigator is Daniel H. Geschwind (UCLA). The Co-Principal Investigators include Stanley F. Nelson and Rita M. Cantor (UCLA), Christa Lese Martin (Univ. Chicago), T. Conrad Gilliam (Columbia). Co-Investigators include Maricela Alarcon (UCLA), Kenneth Lange (UCLA), Sarah J. Spence (UCLA), David H. Ledbetter (Emory) and Hank Juo (Columbia). Scientific oversight of the AGRE program is provided by a steering committee (Chair: Daniel H. Geschwind; Members: W. Ted Brown, Maja Bucan, Joseph D. Buxbaum, T. Conrad Gilliam, David Greenberg, David H. Ledbetter, Bruce Miller, Stanley F. Nelson, Jonathan Pevsner, Carol Sprouse, Gerard D. Schellenberg and Rudolph Tanzi).


  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  • Andrew, A. S., Karagas, M. R., Nelson, H. H., Guarrera, S., Polidoro, S., Gamberini, S., Sacerdote, C., Moore, J. H., Kelsey, K. T., Demidenko, E., Vineis, P. & Matullo, G. (2008) DNA repair polymorphisms modify bladder cancer risk: A multi-factor analytic strategy. Hum Hered 65, 105118.
  • Baksh, M. F., Balding, D. J., Vyse, T. J. & Whittaker, J. C. (2006) A likelihood ratio approach to family-based association studies with covariates. Ann Hum Genet 70, 131139.
  • Baksh, M. F., Balding, D. J., Vyse, T. J. & Whittaker, J. C. (2007) Family-based association analysis with ordered categorical phenotypes, covariates and interactions. Genet Epidemiol 31, 18.
  • Breiman, L. (1996) Bagging predictors. Mach Learn 26, 123140.
  • Breiman, L. (2001) Random forests. Mach Learn 45, 532.
  • Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. (1984) Classification and regression trees. Belmont , CA : Wadsworth.
  • Buehlmann, P. & Yu, B. (2002) Analyzing bagging. Ann Statist 30, 927961.
  • Bureau, A., Dupuis, J., Falls, K., Lunetta, K. L., Hayward, B., Keith, T. P. & Eerdewegh, P. V. (2005) Identifying SNPs predictive of phenotype using Random Forests. Genet Epidemiol 28, 171182.
  • Chen, X., Liu, C. T., Zhang, M. & Zhang, H. (2007) A forest-based approach to identifying gene and gene–gene interactions. Proc Natl Acad Sci USA 104, 1919919203.
  • Clark, T. G., De Iorio, M. & Griffiths, R. C. (2007) Bayesian logistic regression using a perfect phylogeny. Biostatistics 8, 3252.
  • Clark, T. G., De Iorio, M. & Griffiths, R. C. (2008) An evolutionary algorithm to find associations in dense genetic maps. IEEE Trans Evol Comp 12, 297306.
  • Clark, T. G., De Iorio, M., Griffiths, R. C. & Farrall, M. (2005) Finding associations in dense genetic maps: A genetic algorithm approach. Hum Hered 60, 97108.
  • Cordell, H. J., Barratt, B. J. & Clayton, D. G. (2004) Case/pseudocontrol analysis in genetic association studies: A unified framework for detection of genotype and haplotype associations, gene–gene and gene-environment interactions, and parent-of-origin effects. Genet Epidemiol 26, 167185.
  • Cordell, H. J. & Clayton, D. G. (2002) A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: Application to HLA in type 1 diabetes. Am J Hum Genet 70, 124141.
  • Culverhouse, R., Klein, T. & Shannon, W. (2004) Detecting epistatic interactions contributing to quantitative traits. Genet Epidemiol 27, 141152.
  • Culverhouse, R., Suarez, B. K., Lin, J. & Reich, T. (2002) A perspective on epistasis: Limits of models displaying no main effect. Am J Hum Genet 70, 461471.
  • Edwards, T. L., Turner, S. D., Torstenson, E. S., Dudek, S. M., Martin, E. R. & Ritchie, M. D. (2010) A general framework for formal tests of interaction after exhaustive search methods with applications to MDR and MDR-PDT. PLoS One 5, e9363.
  • Etzioni, R., Falcon, S., Gann, P. H., Kooperberg, C. L., Penson, D. F. & Stampfer, M. J. (2004) Prostate-specific antigen and free prostate-specific antigen in the early detection of prostate cancer: Do combination tests improve detection Cancer Epidemiol Biomarkers Prev 13, 16401645.
  • Feng, Q., Balasubramanian, A., Hawes, S. E., Toure, P., Sow, P. S., Dem, A., Dembele, B., Critchlow, C. W., Xi, L., Lu, H., McIntosh, M. W., Young, A. M. & Kiviat, N. B. (2005) Detection of hypermethylated genes in women with and without cervical neoplasia. J Natl Cancer Inst 97, 273282.
  • Garte, S. (2001) Metabolic susceptibility genes as cancer risk factors: Time for a reassessment Cancer Epidemiol Biomarkers Prev 10, 12331237.
  • Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B. M. D., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A. J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J. Y. H. & Zhang, J. (2004) Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol 5, R80.
  • Geschwind, D. H., Sowinski, J., Lord, C., Iversen, P., Shestack, J., Jones, P., Ducat, L., Spence, S. J. & AGRE Steering Committee (2001) The autism genetic resource exchange: A resource for the study of autism and related neuropsychiatric conditions. Am J Hum Genet 69, 463466.
  • Goodman, J. E., Mechanic, L. E., Luke, B. T., Ambs, S., Chanock, S. & Harris, C. C. (2006) Exploring SNP-SNP interactions and colon cancer risk using polymorphism interaction analysis. Int J Cancer 118, 17901797.
  • Greene, C. S., Himmelstein, D. S., Nelson, H. H., Kelsey, K. T., Williams, S. M., Andrew, A. S., Karagas, M. R. & Moore, J. H. (2010) Enabling personal genomics with an explicit test of epistasis. Pac Symp Biocomput, 327336.
  • Hahn, L. W., Ritchie, M. D. & Moore, J. H. (2003) Multifactor dimensionality reduction software for detecting gene–gene and gene–environment interactions. Bioinformatics 19, 376382.
  • Harth, V., Schaefer, M., Abel, J., Maintz, L., Neuhaus, T., Besuden, M., Primke, R., Wilkesmann, A., Thier, R., Vetter, H., Ko, Y. D., Bruening, T., Bolt, H. M. & Ickstadt, K. (2008) Head and neck squamous-cell cancer and its association with polymorphic enzymes of xenobiotic metabolism and repair. J Toxicol Environ Health A 71, 887897.
  • Heidema, G. A., Boer, J. M. A., Nagelkerke, N., Mariman, E. C. M., van de A, D. L. & Feskens, E. J.M. (2006) The challenge for genetic epidemiologists: How to analyze large numbers of SNPs in relation to complex diseases. BMC Genet 7, 23.
  • Justenhoven, C., Hamann, U., Schubert, F., Zapatka, M., Pierl, C. B., Rabstein, S., Selinski, S., Mueller, T., Ickstadt, K., Gilbert, M., Ko, Y. D., Baisch, C., Pesch, B., Harth, V., Bolt, H. M., Vollmert, C., Illig, T., Eils, R., Dippon, J. & Brauch, H. (2008) Breast cancer: A candidate gene approach across the estrogen metabolic pathway. Breast Cancer Res Treat 108, 137149.
  • Keles, S., van der Laan, M. J. & Vulpe, C. (2004) Regulatory motif finding by logic regression. Bioinformatics 20, 27992811.
  • Kooperberg, C. & Ruczinski, I. (2005) Identifying interacting SNPs using Monte Carlo logic regression. Genet Epidemiol 28, 157170.
  • Kooperberg, C., Ruczinski, I., LeBlanc, M. & Hsu, L. (2001) Sequence analysis using logic regression. Genet Epidemiol 21, 626631.
  • Kotti, S., Bickeboeller, H. & Clerget-Darpoux, F. (2007) Strategy for detecting susceptibility genes with weak or no marginal effect. Hum Hered 63, 8592.
  • Li, Q., Fallin, M. D., Louis, T. A., Lasseter, V. K., McGrath, J. A., Avramopoulos, D., Wolyniec, P. S., Valle, D., Liang, K. Y., Pulver, A. E. & Ruczinski, I. (2010a) Detection of SNP–SNP interactions in trios of parents with schizophrenic children. Genet Epidemiol 34, 396406.
  • Li, Q., Louis, T. A., Fallin, M. D. & Ruczinski, I. (2010b) Detection of SNP–SNP interactions in case-parent trios (in revision).
  • Lucek, P. R. & Ott, J. (1997) Neural network analysis of complex traits. Genet Epidemiol 14, 11011106.
  • Lunetta, K. L., Faraone, S. V., Biederman, J. & Laird, N. M. (2000) Family-based tests of association and linkage that use unaffected sibs, covariates, and interactions. Am J Hum Genet 66, 605614.
  • Lunetta, K. L., Hayward, L. B., Segal, J. & van Eerdewegh, P. (2004) Screening large-scale association study data: Exploiting interactions using random forests. BMC Genet 10, 32.
  • Marchini, J., Donnely, P. & Cardon, R. C. (2005) Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet 37, 413416.
  • Martin, E. R., Ritchie, M. D., Hahn, L., Kang, S. & Moore, J. H. (2006) A novel method to identify gene–gene effects in nuclear families: The MDR-PDT. Genet Epidemiol 30, 111123.
  • McKinney, B. A., Reif, D. M., Ritchie, M. D. & H., M. J. (2006) Machine learning for detecting gene–gene interactions: A review. Appl Bioinform 5, 7788.
  • Musani, S. K., Shriner, D., Liu, N., Feng, R., Coffey, C. S., Yi, N., Tiwari, H. K. & Allison, D. B. (2007) Detection of gene × gene interactions in genome-wide association studies of human population data. Hum Hered 63, 6784.
  • Nicodemus, K. K. & Malley, J. D. (2009) Predictor correlation impacts machine learning algorithms: Implications for genomic studies. Bioinformatics 25, 18841890.
  • Nicodemus, K. K., Callicott, J. H., Higier, R. G., Luna, A., Nixon, D. C., Lipska, B. K., Vakkalanka, R., Giegling, I., Rujescu, D., Clair, D. S., Muglia, P., Shugart, Y. Y. & Weinberger, D. R. (2010) Evidence of statistical epistasis between disc1, cit and ndel1 impacting risk for schizophrenia: Biological validation with functional neuroimaging. Hum Genet 127, 441452.
  • North, B. V., Curtis, D., Cassell, P. G., Hitman, G. A. & Sham, P. C. (2003) Assessing optimal neural network architecture for identifying disease-associated multi-marker genotypes using a permutation test, and application to calpain 10 polymorphisms associated with diabetes. Ann Hum Genet 67, 348356.
  • Nunkesser, R., Bernholt, T., Schwender, H., Ickstadt, K. & Wegener, I. (2007) Detecting high-order interactions of single nucleotide polymorphisms using genetic programming. Bioinformatics 23, 32803288.
  • Ritchie, M. D., Hahn, L. W. & Moore, J. H. (2003) Power of multifactor dimensionality reduction for detecting gene–gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol 24, 150157.
  • Ritchie, M. D., Hahn, L. W., Roodi, N., Bailey, L. R., Dupont, W. D., Parl, F. F. & Moore, J. H. (2001) Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet 69, 138147.
  • Ritchie, M. D., White, B. C., Parker, J. S., Hahn, L. W. & Moore, J. H. (2003b) Optimization of neural network architecture using genetic programming improves detection and modeling of gene–gene interactions in studies of human diseases. BMC Bioinformatics 4, 28.
  • Ruczinski, I., Kooperberg, C. & LeBlanc, M. (2003) Logic regression. J Comput Graph Stat 12, 475511.
  • Ruczinski, I., Kooperberg, C. & LeBlanc, M. (2004) Exploring interactions in high-dimensional genomic data: An overview of logic regression, with applications. J Mult Anal 90, 178195.
  • Schaid, D. J. (1996) General score tests for associations of genetic markers with disease using cases and their parents. Genet Epidemiol 13, 423449.
  • Schaid, D. J. (1999) Likelihoods and TDT for the case-parents design. Genet Epidemiol 16, 250260.
  • Schmid, M. & Hothorn, T. (2008) Flexible boosting of accelerated failure time models. BMC Bioinform 9, 269.
  • Schwender, H. & Ickstadt, K. (2008) Identification of SNP interactions using logic regression. Biostatistics 9, 187198.
  • Schwender, H., Ruczinski, I. & Ickstadt, K. (2010) Testing SNPs and sets of SNPs for importance in association studies. Biostatistics, doi:10.1093/biostatistics/kxq042.
  • Segal, M. R., Barbour, J. D. & Grant, R. M. (2004) Relating HIV-1 sequence variation to replication capacity via trees and forests. Stat Appl Genet Mol Biol 3, 2.
  • Suehiro, Y., Wong, C. W., Chirieac, L. R., Kondo, Y., Shen, L., Webb, C. R., Chan, Y. W., Chan, A. S.Y., Chan, T. L., Wu, T. T., Rashid, A., Hamanaka, Y., Hinoda, Y., Shannon, R. L., Wang, X., Morris, J., Issa, J. P. J., Yuen, S. T., Leung, S. Y. & Hamilton, S. R. (2008) Epigenetic-genetic interactions in the apc/wnt, ras/raf, and p53 pathways in colorectal carcinoma. Clin Cancer Res 14, 25602569.
  • Tomita, Y., Tomida, S., Hasegawa, Y., Suzuki, Y., Shirakawa, T., Kobayashi, T. & Honda, H. (2004) Artificial neural network approach for selection of susceptible single nucleotide polymorphisms and construction of prediction model on childhood allergic asthma. BMC Bioinformatics 5, 120.
  • Vaidya, V. S., Waikar, S. S., Ferguson, M. A., Collings, F. B., Sunderland, K., Gioules, C., Bradwin, G., Matsouaka, R., Betensky, R., Curhan, G. C. & Bonventre, J. V. (2008) Urinary biomarkers for sensitive and specific detection of acute kidney injury in humans. Clin Transl Sci 3, 200208.
  • Witte, J. S. & Fijal, B. A. (2001) Introduction: Analysis of sequence data and population structure. Genet Epidemiol 21, 600601.