### Summary

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- References

Permutation tests are widely used in genomic research as a straightforward way to obtain reliable statistical inference without making strong distributional assumptions. However, in this paper we show that in genetic association studies it is not typically possible to construct exact permutation tests of gene-gene or gene-environment interaction hypotheses. We describe an alternative to the permutation approach in testing for interaction, a parametric bootstrap approach. Using simulations, we compare the finite-sample properties of a few often-used permutation tests and the parametric bootstrap. We consider interactions of an exposure with single and multiple polymorphisms. Finally, we address when permutation tests of interaction will be approximately valid in large samples for specific test statistics.

### Introduction

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- References

Permutation tests (Ernst, 2004; Higgins, 2004) are very popular in genomic research (Hu et al., 2008; Faulkner et al., 2009; Leak et al., 2009). They are simple to compute where analytic approaches may be intractable, and can be exact where analytic results may be only approximate. Rather than comparing the observed value of a test statistic to its distribution under repeated sampling, a permutation test compares the observed value to a distribution generated by a group of permutations that would not affect the distribution if the null hypothesis were true (chap. 6.2 Cox & Hinkley, 1997). The main limitation of permutation tests is that they are only applicable when the null hypothesis being tested specifies a suitable group of permutations under which the distribution of the data would be unaffected.

The use of permutation methods for testing in the regression model with one main effect (or, more simply, in tests of association of two variables) dates back at least to Fisher's exact test (Fisher, 1935). From data vectors *G* and *Y* we create a new data set either by permuting the entries of *G* to give data (*G**, *Y*) or permuting the entries of *Y* to give data (*G*, *Y**). The test statistic is evaluated on the new data to give a sample from the permutation distribution, and this procedure is repeated to estimate the permutation distribution as accurately as is desired. A p-value of the test statistic is computed based on the permutation distribution. The procedure is the same whether the predictor variable is continuous or categorical (Ernst, 2004).

When there are two predictors *G* and *Z*, permutation testing can become more complicated (Anderson & Robinson, 2001). Testing for both main effects being zero is possible, by permuting the outcome *Y* and leaving *G* and *Z* unchanged, and using datasets (*G*, *Z*, *Y**) to compute the permutation distribution of a test statistic. However, an exact test for one specific main effect being zero, i.e., testing partial regression coefficients, typically does not exist, as it would require the true value of the other main effect to be known. Anderson & Robinson (2001) compare four approximate permutation tests for partial regression coefficients in models with two main effects, highlighting the Freedman & Lane (1983) method. They note that, typically, the exact test for both main effects is not even approximately valid for testing one main effect. One special case of an available exact test for a main effect of *G* is when *Z* is categorical, with several replicates of each of the fixed values. In this case, permutations of *Y* or *G* can be done within the groups defined by *Z*. In genetic applications, a binary covariate *Z* such as treatment or a categorical genotype at a single nucleotide polymorphism can be used in this way.

A summary of permutation testing in regression for a non-statistical audience can be found in Anderson (2001). The article summarises permutation testing in models with one and two main effects, and notes that in a model with two main effects and an interaction term there is no exact permutation method for testing the interaction term. For tests of interactions, even with categorical *G* and *Z* no exact permutation method is available (Anderson, 2001). This is because permutation of *Y* within levels of *G* and levels of *Z* generates new data with the interaction effect unchanged – not removed, as we require for testing. In fact, for all models with one or two main effects and an interaction, Anderson (2001) notes that in general there is no exact permutation method for testing the interaction term.

Though well-established in the statistical literature on experimental design, this result is not widely known in genetic epidemiology or pharmacogenetics. Permutation-based tests for interaction have in fact been used frequently without any rationale given for their exact or approximate validity (Chase et al., 2005; Andrulionyte et al., 2007; Mei et al., 2007; Rana et al., 2007). In this paper we show that these permutation tests need not even be approximately valid. We describe an alternative, the parametric bootstrap, which can give valid tests with moderate sample sizes, and which requires similar computational effort to a permutation test. Parametric bootstrap techniques have been correctly used in a genetic setting, e.g. in (Chen et al., 2007). We will discuss the choice of test statistic and show that a standardised statistic, such as a *z*-score or *p*-value instead of a difference in means, can improve the accuracy of parametric bootstrap, and improve adherence of the Type I error rate to the nominal level.

The rest of the paper is organised as follows. In the next section we introduce models with an interaction term, and permutation concepts. We contrast the problem of testing for interaction with the problem of testing for overfitting in a model including interactions, where methods such as logic regression and multifactor-dimensionality reduction (MDR) do validly use permutation tests. We subsequently describe a parametric bootstrap approach to testing for interaction, and evaluate the performance of the parametric bootstrap compared to two types of permutations used commonly in interaction testing. Finally, we consider scenarios where permutation tests of interaction will be approximately valid in large samples for specific test statistics. These scenarios include some of the practical applications of permutation tests for interaction in genetic association studies.

### Discussion

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- References

The statistical literature shows that exact permutation tests for interactions are not available in most situations. We have described two permutation tests that have been used in practice for interaction testing, and showed that they are not exact. The error in the tests can be substantial. A practical alternative for interaction testing is the parametric bootstrap. In our simulations the parametric bootstrap, while not exact, always outperformed the invalid permutation tests. Since the parametric bootstrap performs better and does not require greater computational effort, it could be recommended for scenarios similar to our simulations.

We contrasted two types of test statistics, based on approximately pivotal quantities (i.e. based on *z*) and not pivotal quantities (i.e. based on ). For the parametric bootstrap these test statistics gave similar performance in our simulations, but the Type I error rate using permutation methods was substantially less accurate when using non-pivotal quantities. Permutation methods did perform acceptably when the sample size was large, and when approximately-pivotal test statistics were used.

It is important to remember that neither the parametric bootstrap nor the permutation tests are exact tests for interaction in small samples. It is also important to remember that, in contrast to the hypothesis of no association, the hypothesis of no interaction is intrinsically dependent on the form of the model. Any approach to testing for interaction must therefore be model-based to some extent.