SEARCH

SEARCH BY CITATION

Keywords:

  • gene × gene joint action;
  • minor allele frequency;
  • Hardy-Weinberg disequilibrium;
  • linkage disequilibrium

ABSTRACT

  1. Top of page
  2. ABSTRACT
  3. Introduction
  4. Method
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References
  9. Appendix

We propose a new approach to detect gene × gene joint action in genome-wide association studies (GWASs) for case-control designs. This approach offers an exhaustive search for all two-way joint action (including, as a special case, single gene action) that is computationally feasible at the genome-wide level and has reasonable statistical power under most genetic models. We found that the presence of any gene × gene joint action may imply differences in three types of genetic components: the minor allele frequencies and the amounts of Hardy-Weinberg disequilibrium may differ between cases and controls, and between the two genetic loci the degree of linkage disequilibrium may differ between cases and controls. Using Fisher's method, it is possible to combine the different sources of genetic information in an overall test for detecting gene × gene joint action. The proposed statistical analysis is efficient and its simplicity makes it applicable to GWASs. In the current study, we applied the proposed approach to a GWAS on schizophrenia and found several potential gene × gene interactions. Our application illustrates the practical advantage of the proposed method.


Introduction

  1. Top of page
  2. ABSTRACT
  3. Introduction
  4. Method
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References
  9. Appendix

The availability of high-throughput genotyping technology has enabled genome-wide association studies (GWASs), which have provided an unprecedented opportunity to identify genetic variants for many human diseases, including age-related macular degeneration [Klein et al., 2005] , body mass index [Church et al., 2010], Crohn's disease [Barrett et al., 2008], and Alzheimer's disease [Harold et al., 2009]. However, although GWASs have successfully identified many susceptibility variants, the findings explain only a fraction of the expected overall heritability and replication of significant results has often failed. It has been speculated that some of the unexplained heritability and inconsistency among replication studies may be caused by epistatic effects [Culverhouse et al., 2002]. Simulations, for instance, have shown that in the presence of gene × gene interaction, differences in the allele frequency between the initial study and the replication populations can affect the power of single-locus strategies and hamper the reproducibility of true findings [Marchini et al., 2005]. At the same time, increasing empirical evidence supports that general gene × gene interactions [Zerba et al., 2000] play a key role in the variation of complex traits. Consequently, the systematic statistical analysis of gene × gene interaction is critical for the success of GWASs in genetic epidemiology.

However, in spite of the importance of gene × gene interaction, the meaning of gene × gene interaction has not been clearly understood, and statistical gene × gene interaction has often been confused with biological gene × gene interaction. In particular, the inference on the biological mechanism is complicated because of the lack of direct correspondence between statistical and biological interaction [Ueki and Cordell, 2012]. Statisticians usually define statistical interaction as a departure from additivity in a linear model using a selected measurement scale. However, as was pointed out by Wang et al. [Wang et al., 2011], if one aims to infer biological interactions, statistically modeled interactions and main effect terms should not be separately interpreted. In this context, it has been suggested that the term “biological interaction” can be detected by the statistical model that compares joint multilocus genotype frequencies [Wang et al., 2010]. In this study we focus on the biological interaction and it will be denoted as the joint action to avoid confusion.

Numerous parametric and nonparametric approaches have been proposed to detect gene × gene joint action. Parametric approaches are usually based on logistic regression [Cordell, 2002] and shrinkage approaches have been applied to decrease the multiple testing problems [Park and Hastie, 2008]. Nonparametric approaches, including the multifactor dimensionality reduction (MDR) [Ritchie et al., 2001], multivariate adaptive regression spline [Cook et al., 2004], and random forest approaches [Jiang et al., 2009], have also been adapted to detect gene × gene joint action. However, most statistical methods for detecting gene × gene joint action are affected by some intractable issues. First, gene × gene joint action analysis at the genome-wide scale is computationally intensive and computational requirements can detract from the strength of each method. MDR has been shown to be a powerful approach for complex gene × gene joint action [Ritchie et al., 2001]; however, the initial marker selection step, which alleviates the computational intensity, can reduce the statistical efficiency of this approach. Cordell concluded that the semiexhaustive search of two-locus interactions implemented in PLINK [Purcell et al., 2007] and the random forests analysis implemented in Random Jungle [Cordell, 2009] are computationally feasible approaches. Recent computational achievement based on graphic processing units [Chikkagoudar et al., 2011; Yung et al., 2011] or multithreaded parallelization [Gyenesei et al., 2012] also enables the genome-wide gene × gene joint action analysis. However, further investigation on the efficient algorithm is still necessary. Second, even for a two-locus joint action model, there are 50 different models for gene × gene joint action [Li and Reich, 2000], and an exhaustive search of all these interaction models generates a substantial statistical burden. To maximize statistical efficiency, it may be better to search for significant differences in some parameters obtained by transforming the two-locus genotype frequencies.

In genetic association analysis for multiple SNPs, three potential statistically independent quantities provide information about the genetic association at unlinked loci [Won and Elston, 2008; Won et al., 2009a]. First, the association of single or multiple SNPs with disease produces the different allele frequencies between cases and controls [Sasieni, 1997]. Second, the amount of Hardy-Weinberg disequilibrium (HWD) at single or multiple loci may differ between cases and controls [Nielsen et al., 1998]. Third, the amount of linkage disequilibrium (LD) may differ between cases and controls [Nielsen et al., 2004], where the composite LD can be used if genotypes are not phased [Zaykin et al., 2006]. The difference of these quantities between cases and controls can be utilized to detect the gene × gene joint action. In this report, information about the presence of a gene × gene joint action that is included in these three different quantities is utilized to maximize the statistical power of the analysis, while at the same time, the numerical complexity is minimized. Thus, the proposed approach enables gene × gene joint action analysis on a genome-wide scale via exhaustive analysis of all possible two-way interactions.

Furthermore, we applied our method to a GWAS on schizophrenia (SCZ). SCZ is a common psychiatric disorder that results from interactions between genetic and environmental factors [Sullivan, 2005]. Recent GWASs [Athanasiu et al., 2010; Djurovic et al., 2010] have identified genetic risk variants for SCZ, but the findings explain only a small fraction of the estimated heritability, which is around 80% [Cardno and Gottesman, 2000]. One hypothesis is that gene × gene interactions may be the cause of the missing heritability. We applied the proposed method to a GWAS on SCZ, and discovered the occurrence of some potential gene × gene joint action. Our analysis results for SCZ illustrate the substantial advantages of our approach in terms of both computational intensity and statistical power.

Method

  1. Top of page
  2. ABSTRACT
  3. Introduction
  4. Method
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References
  9. Appendix

We consider the case-control studies, and the causal gene × gene joint action of two loci with two alleles is assumed. The two alleles at the two loci are denoted by E and e, and F or f, respectively, and we used inline image and inline image to indicate the frequencies of the phase-known genotype (EF|EF) in cases and controls, respectively. It should be noted that a and u as superscripts are used to indicate cases and controls, respectively. We considered pE and pF as allele frequencies in the population, and inline image(inline image) and inline image(inline image) as allele and genotype frequencies, respectively, in cases (controls). We denote the joint frequency at two different gametes in cases and controls by inline image and inline image, respectively. inline image and inline image are the amounts of LD in cases and controls. Because LD is usually not observed, we consider the composite LD [Weir, 1996; Zaykin, 2004] and the composite LDs in cases and controls are denoted as inline image and inline image, respectively. We denote the amount of single-locus level and haplotype-level HWD for the cases by inline image and inline image, respectively. The numbers of cases and controls are assumed to be na and nu, respectively.

Analysis of gene × gene joint action typically tests the homogeneity of genotype frequencies in cases and controls. There are 9 and 10 different genotypes at two SNPs when the genotypes are unphased and phased, respectively, and the differences in the two-locus genotype frequencies between cases and controls suggest the presence of gene × gene joint action, a special case of which could be single gene action. Therefore, the presence of gene × gene joint action can be detected by comparing the two-locus genotype frequencies. In particular, when genotypes are phased, the null hypothesis for any joint action analysis is as follows:

  • display math

The corresponding null hypothesis for the phase-unknown genotypes can be simply constructed. This null hypothesis is equivalent to the following null hypothesis under a transformation [Won and Elston, 2008]:

  • display math

Here, nine components that comprise the two minor allele frequencies, six components for the level of haplotype HWD, and one component for the level of LD are required to test the equivalence of the phased genotype frequencies, and they can be categorized into three types of informative components for association analysis [Won and Elston, 2008]: the allele frequency, the amount of HWD, and the level of LD. Under this transformation, general gene × gene joint action can be detected by comparing the differences in these three types of components instead of the two-locus genotype frequencies.

In the proposed null hypothesis, however, the haplotype is often not available, and estimating haplotype frequencies for the two loci increases the additional computational complexity. Consequently, we consider the amount of single-locus level HWD instead of haplotype-level HWD, and composite LD [Wang et al., 2007; Won et al., 2009a; Zaykin, 2004] for the computation of inline image and inline image. Furthermore, because Hardy-Weinberg equilibrium (HWE) at each marker and linkage equilibrium (LE) between two markers are usually preserved in a population and cases are often less common than controls, HWE and LE tests computed in cases often constitute a more efficient approach than testing the different amounts of HWD and LD between cases and controls [Wang and Shete, 2008; Won and Elston, 2008]. The following testing strategy therefore seems to be justified. If the results for the tests for HWE and LE in controls are not significant, we propose to use tests for HWE and LE only in cases; if they are significant, the amounts of HWD or LD in cases and controls have to be compared. Consequently, when HWE and LE are preserved in controls, we consider the following null hypothesis:

  • display math

If two markers are linked and the HWE at each locus is not preserved in controls, the null hypothesis is equivalent to the following:

  • display math

For statistical testing under this null hypothesis, we first calculated the statistics for each equality and then combined their P-values into a single statistic. When HWE and LE were preserved in controls, our individual statistics for each equality in the null hypothesis were as follows:

  • math image

Here the variances in the amount of HWD and composite LD under HWE and LE [Weir, 1996] can be calculated as follows:

  • math image

When HWE and LE are not guaranteed in controls, we should test the homogeneity of the level of the HWD and LD between cases and controls, rather than inline image, inline image, and inline image:

  • math image

After we calculated these statistics, the second step of our approach was to provide overall inference based on these derived individual statistics. Although the most powerful test can be constructed when some information is available about effect sizes [Won et al., 2009b], there does not exist the uniformly most powerful method for combining P-values of different test statistics [Birnbaum, 1954]. We decided to use Fisher's method in this study because this method has been shown to be the most efficient approach in terms of Bahadur's relative efficiency [Little and Folks, 1971]. Therefore, we propose the following overall test statistic to test for the presence of a gene × gene joint action. Under HWE and LE in the population, the test statistics, inline image, inline image, inline image, inline image, and inline image are mutually independent [Won and Elston, 2008], and Fisher's method can be used to combine these statistics. Therefore, if we denote their P-values by inline image, inline image, inline image, inline image, and inline image, respectively, our overall statistic under HWE and LE in the population becomes

  • math image

and they are denoted as FIS1. In the absence of HWE or LE, we cannot guarantee their independence, and in this case, or when the sample sizes are small, permutation tests have to be used to maintain the correct significance level of the overall test statistic. If we consider the P-values for inline image, inline image, and inline image as inline image, inline image, and inline image, respectively, we can calculate −2[loginline image + loginline image+ loginline image+ loginline image+ loginline image] for both the original and permuted samples, and that from the original sample needs to be compared with those from permuted samples to calculate the empirical P-value.

When detection of only a statistical gene × gene interaction that is orthogonal to the main effects is required, inference with the proposed overall statistic is not valid because a difference in minor allele frequencies by itself can generate significant results. Therefore, it is better to consider only inline image or inline image. Notably, var(inline image) and var(inline image) for inline image and inline image in the presence of marginal effects are affected by the disease mode of inheritance. For instance, even when HWE and LE are present in the population, the presence of dominant or recessive disease effects generates HWD in cases and controls, and then the variances of inline image and inline image in the absence of gene × gene interaction are respectively shown by Weir [Weir, 1996] as follows:

  • math image

Results

  1. Top of page
  2. ABSTRACT
  3. Introduction
  4. Method
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References
  9. Appendix

Simulation Studies for Joint Action Analysis

All possible two-locus models can be explained with 50 different models, and in our simulations we considered the 8 typical two-locus models for gene × gene joint action [Dong et al., 2008], M1–M8 (see Fig. 1): the joint recessive-recessive model, joint recessive-dominant model, single-gene recessive model, threshold model, modifying-effect model, joint dominant-dominant model, exclusive OR model, and the diagonal model. The diagonal model is the disease model where the diagonal elements in the contingency table of two-locus genotypes have higher risk than the off-diagonal elements (see Fig. 1). In Figure 1, λ1 and λ2 denote the disease genotype relative risk in the case of low and high risk, respectively, and λ1 is 1.

image

Figure 1. Eight typical two-locus models. λ1 denotes low-risk genotype combinations and λ2 denotes high-risk genotype combinations. M1, joint recessive-recessive model (RR); M2, joint recessive-dominant model (RD); M3, single-gene recessive model (1L: R); M4, threshold model (T); M5, modifying-effect model (Mod); M6, joint dominant-dominant model (DD); M7, exclusive OR model (XOR); M8, diagonal model (Diagonal).

Download figure to PowerPoint

We calculated the noncentrality parameters of the statistics for each equality in the alternative hypothesis for gene × gene joint action analysis. The disease prevalence and the disease allele frequency were assumed to be 0.2. We assumed HWE and LE between two SNPs in the population and, under these circumstances, it should be noted that the statistics for the three components are asymptotically mutually independent [Won and Elston, 2008]. The penetrance for each disease genotype was calculated under the assumed disease model. Figure 2 shows the noncentrality parameters for M1–M8. Our results show that the three types of components are usually informative for each joint action model except M8. Furthermore, testing differences in the HWD and LD between cases and controls is a less powerful approach than testing HWE and LE in cases alone.

image

Figure 2. Noncentrality parameters for inline image and inline image under M1, M2, …, M8. Noncentrality parameters were calculated for the three types of components under the four gene × gene interaction models (see Fig. 1). The disease allele frequencies for both SNPs were assumed to be 0.2. The disease prevalence was assumed to be 0.2, and the penetrance for each disease genotype was calculated under the assumed disease prevalence.

Download figure to PowerPoint

Furthermore, the empirical estimates of type-1 error and power were calculated with the simulated data. The disease prevalence was assumed to be 0.2. We assumed HWE and LE in the population. The disease genotypes were generated from the multinomial distribution and their phenotypes were then sampled with the penetrances of their genotypes. Sampling was repeated until the given numbers of cases and controls were obtained. In our simulation, FIS1 combines inline image, inline image, inline image, inline image, and inline image with Fisher's method. FIS2 combines inline image, inline image, inline image, inline image, and inline image. Table 1 shows the empirical type-1 error estimates from 5,000 replicates at the 0.1 significance level. In Table 1, we assumed the disease genotype relative risks λ1 = λ2 = 1 for any combination of two SNPs. The disease allele frequencies for the two disease loci were randomly generated from the uniform distribution U(0.1, 0.9). Our results show that there is no evidence of inflation for the proposed methods for detecting joint action.

Table 1. Empirical type-1 error at the 0.1 significance level
Ninline imageinline imageFIS1FIS2
  1. The empirical type-1 errors at the 0.1 significance level and 95% confidence intervals were calculated from 5,000 replicates. Cases (na = 1,500) and controls (nu = 1,500) were generated, and the disease allele frequencies for the two disease loci were randomly generated from the uniform distribution U(0.1, 0.9). FIS1 combines inline image, inline image, inline image, inline image, and inline image with Fisher's method and FIS2 combines inline image, inline image, inline image, inline image, and inline image.

1,0000.103 ± 0.0090.116 ± 0.0090.098 ± 0.0080.114 ± 0.009
2,0000.099 ± 0.0080.099 ± 0.0080.108 ± 0.0090.096 ± 0.008
3,0000.096 ± 0.0080.09 ± 0.0080.099 ± 0.0080.097 ± 0.008

The empirical power estimates at the 0.01 significance level have been shown in Figure 3. In our simulations, we considered nine different joint action models, that is, M1–M9. For M1–M8, the joint action models are shown in Figure 1, and λ2 was assumed to be 1.5. For M9, the disease genotype relative risks for each two-locus genotype were randomly sampled from the uniform distribution U(1, 1.5). In particular, the detectable relative risks for 80% power by using FIS1 for M1–M8, for the case illustrated in Figure 3, are 7.9, 2.4, 1.7, 1.9, 1.6, 1.4, 1.5, and 1.6, respectively, which indicates that M1 contrary to the other disease models have low power because only one genotype has high relative risk. For disease allele frequencies, we assumed that pE = pF = 0.2, but we found that the results for other disease allele frequencies were similar. For comparison with the traditional methods, the generalized linear model for a binary trait with a logit link was considered. LRT11 indicates the likelihood ratio test that compares the linear interaction model in logistic regression with the intercept model, and in LRT21 the linear interaction model in logistic regression has been compared with the linear model without the interaction terms. Therefore, under the null hypothesis, LRT11 follows the chi-square distribution with three degrees of freedom, and LRT21 follows the chi-square distribution with one degree of freedom. For LRT12 and LRT22, dummy variables are utilized for each two-locus genotype score instead of additive genotype scores. The linear model with nine dummy variables, one for each genotype, in logistic regression is compared with the intercept model for the former, and with the linear model without interaction terms for the latter. Therefore, LRT12 and LRT22 indicate likelihood ratio tests with eight and four degrees of freedom, respectively.

image

Figure 3. Empirical power estimates at the 0.01 significance level for gene × gene joint action analysis. Empirical power for various values of λ2 was estimated with 5,000 replicates at the 0.01 significance level. The disease allele frequencies for two disease SNPs were assumed to be 0.3 and 0.1, respectively. The disease prevalence was assumed to be 0.2. FIS1 combines inline image, inline image, inline image, inline image, and inline image with Fisher's method, and FIS2 combines inline image, inline image, inline image, inline image, and inline image. LRT11 (LRT21) indicates the likelihood ratio test that compares the linear interaction model in logistic regression with the intercept model (the linear model). For LRT12 and LRT22, dummy variables are utilized, and both indicate the likelihood ratio tests that compare the linear interaction model in logistic regression with the intercept model and the linear model, respectively.

Download figure to PowerPoint

FIS1 seems to generally be the most powerful approach (Fig. 3), although the most powerful approach varies depending on gene × gene joint action model. In particular, FIS1 usually performs better than FIS2, and this indicates that testing HWE and LE in cases is more efficient than testing the differences in HWE and LE between cases and controls. Comparison of FIS1 with T3|a showed the power improvement of the overall statistics compared to the individual statistic, T3|a. The results for M9 indicate that the proposed method is more informative than the other methods when the two-locus genotype effects are randomly assigned. LRT12(LRT22) is usually more efficient than LRT11(LRT21), and the gene × gene joint action model using dummy variables for each two-locus genotype score seems to be better than additive genotype scoring. These results suggest that for many gene × gene joint action models, the overall differences between cases and controls are captured by the proposed statistic and that the individual statistics for the marginal effects are also informative for detecting gene × gene joint action. The logistic regressions simply compare the differences between the joint disease genotype frequencies, but their performance can be affected by the genotype scoring.

Simulation Studies for Statistical Gene × Gene Interaction

The empirical type-1 error and power estimates for statistical gene × gene interaction analysis were also evaluated with the simulated data. The disease prevalence was set to 0.2, and HWE and LE in the population were assumed at the disease locus. We assumed that both pE and pF were 0.2. The disease genotypes were generated from a multinomial distribution, and their phenotypes were then sampled with the penetrance for each genotype. Sampling was repeated until 1,500 cases and 1,500 controls were obtained. We considered inline image, inline image, LRT21, and LRT22 for gene × gene interactions, and inline image and inline image in inline image and inline image were calculated under both the presence and absence of HWE. In addition, we considered the Wald test for the coefficient of the statistical gene × gene interaction term in the logistic regression, and it was denoted by inline image.

For statistical validity, we assumed the presence of the marginal genetic effects in the absence of statistical gene × gene interactions; inference regarding the statistical gene × gene interaction term should not be affected by the presence of marginal genetic effects. Even when HWE is preserved in the population, the HWE in cases is not preserved at a disease locus with a dominant or recessive marginal genetic effect [Won and Elston, 2008], and both disease models were considered in our simulation studies. We assumed that E was a disease allele, and the high disease-genotype risk, λ2, for the marginal genetic effect was assumed to be 1.5. We calculated the empirical type-1 error estimates from 5,000 replicates at the 0.1 significance level (Table 2). The results show that inline image and inline image always preserve the significance level if inline image and inline image are calculated under HWD but that the empirical type-1 errors are inflated otherwise. The empirical type-1 error estimates have been slightly inflated for LRT21, LRT22, and for WALD12 under the dominant marginal effect model, and these statistics appear to be affected by the presence of marginal effects. We also calculated the empirical power estimates (Fig. 4). M1–M9, excluding M3, were considered, because M3 has no gene × gene interaction effect. Our results show that LRT22 with four degrees of freedom is the most efficient and it is followed by inline image. However, the parameterization for LRT22 includes the marginal genetic effect because the empirical type-1 error rate estimates have not preserved the nominal significance level and the computation of the log-likelihood ratio test for logistic regression is known to be intensive. Therefore inline image can be a reasonable alternative for gene × gene interaction analysis at the genome-wide scale.

Table 2. Empirical type-1 error at the 0.1 significance level
  inline imageinline image   
Disease mode of inheritanceλ2HWEHWDHWEHWDLRT21LRT22WALD12
  1. The empirical type-1 error at the 0.1 significance level was calculated from 5,000 replicates. 1,500 cases and 1,500 controls were generated and the disease allele frequencies for the two disease loci were randomly generated from the uniform distribution U(0.1, 0.9). LRT21 indicates the likelihood ratio test that compares the linear interaction model in logistic regression with the linear model. For LRT22, dummy variables are utilized and it indicate the likelihood ratio tests that compare the linear interaction model in logistic regression with the linear model. WALD12 indicates the Wald test for the coefficient of statistical gene × gene interaction in an additive genotype scoring.

Recessive20.1222 ± 0.00920.0980 ± 0.00840.1178 ± 0.00910.1094 ± 0.00880.1026 ± 0.00860.1150 ± 0.00900.1018 ± 0.0086
 30.1364 ± 0.00970.0972 ± 0.00840.1150 ± 0.00900.0988 ± 0.00840.0890 ± 0.00810.1008 ± 0.00850.0880 ± 0.0080
 40.1482 ± 0.01000.1000 ± 0.00850.1244 ± 0.00930.1064 ± 0.00870.0766 ± 0.00750.1022 ± 0.00860.0758 ± 0.0075
Dominant20.0794 ± 0.00760.1052 ± 0.00870.0946 ± 0.00830.1066 ± 0.00870.1142 ± 0.00900.1146 ± 0.00900.1138 ± 0.0090
 30.0654 ± 0.00700.1100 ± 0.00880.0806 ± 0.00770.1024 ± 0.00860.1328 ± 0.00960.1124 ± 0.00890.1320 ± 0.0096
 40.0486 ± 0.00610.1064 ± 0.00870.0706 ± 0.00720.1044 ± 0.00860.1360 ± 0.00970.1156 ± 0.00900.1348 ± 0.0097
image

Figure 4. Empirical power estimates at the 0.01 significance level for gene × gene statistical interaction analysis. Empirical powers for various λ2 were estimated with 5,000 replicates at the 0.01 significance level. The disease allele frequencies for two disease SNPs were assumed to be 0.2 and 0.2, respectively. The disease prevalence was assumed to be 0.2. M3 was excluded because it assumes that there is a single gene effect.

Download figure to PowerPoint

Data Analysis

We applied the proposed method to the German-Dutch GWAS on SCZ. The details of the study are provided in the Appendix. Because heterogeneity between the Dutch and German subjects was observed in the sample, the individual statistics for each equality in our null hypothesis were calculated separately for each group, i.e., Dutch samples vs. German samples. The tests of HWE in both cases and controls were used for quality-control purposes. Therefore, even though the tests of HWE in cases are informative to detect the joint action, these were not informative in our situation, and we did not include them in the overall statistics, because all SNPs with small P-values for the HWE test were excluded from the analysis. In addition, we focused on unlinked markers. Pairs of markers were considered as being linked if inline image for the pairs of SNPs are significant at the 0.05 significance level, and they were excluded from the analysis. For unlinked markers, we used inline image to detect the gene × gene interaction and it should be noted that inline image and inline image are independent. If we let the subscripts Dutch and German distinguish the P-values for Dutch samples and German samples, respectively, our overall statistic, FIS1, is

  • display math

which follows the chi-square distribution with 12 degrees of freedom under H0. We also confirmed the presence of population admixture by estimating the amount of variance inflation for inline image. Our qq plots in Figure 5 show evidence for conservativeness of the statistics, and genomic control was applied to inline image. inline image has been shown to be robust against the presence of population stratification [Zhao et al., 2006].

image

Figure 5. qq plot of inline image. The original P-values from inline image for Germany samples are used for the qq plot (A) and genomic control is applied for plot (B). The original P-values from inline image for Dutch samples are used for qq plot (C) and genomic control is applied for plot (D).

Download figure to PowerPoint

The results of our analysis are shown in Table 3. In our analysis, the number of statistics being tested was around 3.56 × 1010, and our genome-wide significance level was 1 × 10−12. The gene × gene joint action analysis by logistic regression at the genome-wide scale is computationally very intensive, and therefore, it was applied to only the top 15 pairs of SNPs selected by FIS1. These selected pairs of SNPs were analyzed by inline image, FIS1, LRT11, LRT21, LRT12, and LRT22 . None of them was significant at the genome-wide level (Table 3). The most significant results for FIS1 were from the pair rs5765930 and rs10812774 [Liu et al., 2012]. rs5765930 is located in the region of the genes PRR5 and LOC553158, and it is in chromosome 22q13.31, a region in which evidence for linkage to SCZ has been reported [Liu et al., 2012]. rs10812774 is located in the vicinity of LIGO2 in chromosome 9, and rs10812774 has been reported to be associated with other neurologic diseases such as essential tremor and Parkinson's disease [Wu et al., 2011]. Among SNPs in the most significant 15 SNP pairs, EIEF2AP4, PAX6, STX1A, POU5F1P2, SLC15A1, TCF4, and CTBP1 have been reported to be related with SCZ [Chiang et al., 2011; Chu et al., 2009; Thwaites and Anderson, 2007; Tsuang et al., 2005; Zhu et al., 2013]. PTPRG, NCALD, CMTM8, and PGM have been reported to be related with neurologic diseases or psychopathy [Baldi et al., 1993; Csutora et al., 2006; LeBlanc et al., 2012; Weber et al., 2011]. In particular, inline image is 7.13 × 10−12 for rs11779840 and rs10110783 and the statistical gene × gene interaction is suspected for this pair. The LRT11 for rs4717104 and rs9870965 is 8.83 × 10−11 but this significance can be invalid because of the population admixture and some modification is necessary. These pairs of SNPs will be further investigated in our follow-up studies.

Table 3. Application to SCZ
SNP1chromosomeposition (kb)GeneSNP2chromosomeposition (kb)Geneinline imageFIS1LRT11LRT21LRT12LRT22
  1. The proposed method was applied to SCZ, and the top 15 pairs of SNPs are shown. FIS1 combines inline image, inline image, inline image, inline image, and inline image with Fisher's method, and FIS2 combines inline image, inline image, inline image, inline image, and inline image. LRT11 (LRT21) indicates the likelihood ratio test that compares the linear interaction model in logistic regression with the intercept model (the linear model). For LRT12 and LRT22, dummy variables were utilized, and both indicate the likelihood ratio tests that compare the linear interaction model in logistic regression with the intercept model and the linear model, respectively. The position for each SNP is provided by NCBI reference build 36.3.

rs57659302243,494PPRrs10812774928,284LINGO21.70·10−61.25·10−104.32·10−51.69·10−37.60·10−23.61·10−4
rs17761611240,136EXO1rs4643163,898PGM16.12·10−62.33·10−101.08·10−41.77·10−56.24·10−56.86·10−4
rs24875041044,575EIEF2AP4rs13814623160,002MFSD17.41·10−72.46·10−101.15·10−25.83·10−19.89·10−11.96·10−1
rs21609062235,823TMPRSS6rs99511501850,972TCF42.76·10−52.77·10−105.84·10−69.05·10−24.92·10−15.39·10−4
rs147664473,894SDK1rs375593041,227CTBP11.44·10−82.85·10−103.06·10−23.07·10−17.79·10−11.95·10−1
rs2289042235,825TMPRSS6rs99511501850,972TCF42.76·10−53.03·10−104.74·10−61.00·10−14.94·10−14.33·10−4
rs16147781131,911PAX6rs9320124140,916MAML35.01·10−43.43·10−106.61·10−31.38·10−21.84·10−11.07·10−1
rs9354442667,323Intergenicrs450092321,433VENTXP72.26·10−73.91·10−107.56·10−85.49·10−52.61·10−31.68·10−5
rs21609062235,823TMPRSS6rs20512931850,969TCF42.50·10−64.13·10−101.98·10−41.99·10−18.02·10−11.13·10−2
rs4717104772,784STX1Ars9870965362,048PTPRG1.94·10−44.26·10−108.83·10−117.93·10−41.68·10−21.14·10−8
rs2289042235,825TMPRSS6rs20512931850,969TCF42.50·10−64.52·10−101.63·10−42.31·10−18.34·10−19.78·10−3
rs10099863898,944RPS23P1rs6790564332,117CMTM83.46·10−74.78·10−103.87·10−45.95·10−53.34·10−31.85·10−2
rs117798408103,606POU5F1rs10110783810,3147NCALD7.13·10−124.88·10−109.56·10−61.62·10−61.64·10−49.86·10−4
rs4717104772,784STX1Ars2526421362,046PTPRO2.01·10−45.28·10−101.17·10−107.94·10−41.66·10−21.45·10−9
rs95134751398,189SLC15A1rs37513791338,486PROSER11.10·10−75.67·10−106.78·10−28.59·10−13.66·10−11.65·10−1

Discussion

  1. Top of page
  2. ABSTRACT
  3. Introduction
  4. Method
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References
  9. Appendix

In this study, we proposed a new statistical method to detect gene × gene joint action based on three types of components. There is no uniformly most powerful method for the analysis of gene × gene joint action; however, our results show that the proposed method performs better than logistic regression under the considered gene × gene joint action model. When the relative risk for each genotype is randomly generated, it was shown that the proposed method performs well, which indicates that the proposed method may be an efficient choice for gene × gene joint action when no prior knowledge is available about the interaction model. In particular, we assumed no haplotype effects in our simulations. If there are haplotype effects, the performance of the proposed method may improve because our LD test focuses on the co-occurrence of two alleles in the same haplotype.

The proposed method is computationally fast. For instance, our analysis of the SCZ study was completed in a week for a cluster with six nodes, which involved genome-wide examination of all pairwise joint actions. For 1,000 cases and 1,000 controls with 1,000 pairs of SNPs, the analysis of the proposed method in the statistical software, R, with Xeon E5649 @2.66 GHz, was completed in 1.7 sec, whereas a logistic regression with R required more than 34.9 sec. For 5,000 and 10,000 pairs of SNPs, the proposed methods were calculated in 8.5 sec and 16.2 sec, respectively, and the logistic regressions were in 176.5 sec and 333.5 sec. Because the logistic regression in R used the embedded C code, the difference may be larger under the same condition. This shows that our approach can be used for genome-wide analysis of gene × gene joint action. To date, many methods for detecting joint action at a genome-wide level have been investigated. However, most of them are limited by their computational burden, which often forces their analysis to be restricted to “select SNPs.” Although the impact of a screening step for gene × gene interaction has not been carefully investigated, we expect that two-stage analysis approaches, for instance, for MDR, can lead to a substantial drop in efficiency. The computational simplicity of our approach is a key advantage in the GWAS context, and the software implemented in C for the proposed method can be freely downloaded at http://cau.ac.kr/~swon.

Nevertheless, there exist some limitations in our simulation studies. First, we compared the power of the proposed method only with the logistic regression and the power comparison results of the proposed method may not readily generalize. Recently many statistical methods such as MDR [Ritchie et al., 2001] and Bayesian epistasis association mapping [Zhang and Liu, 2007] have been proposed. Even though they were not included in the simulation studies, the power comparison of these methods with logistic regression repeatedly showed that logistic regression was not efficient [Chen et al., 2011; Wu et al., 2010; Zhang and Liu, 2007]. However, even though the logistic regression for a statistical gene × gene joint action may not be the most efficient choice of joint action analysis, it should be noted that the statistical efficiency for the logistic regression depends on the parameterization of gene × gene interaction model. It was empirically shown that the logistic regression performed the most efficient when genotypes are coded to reveal the true disease model [Ueki and Cordell, 2012], and the logistic regression with nine dummy variables has been shown to be reasonably good [Chen et al., 2011]. Our results also show that the logistic regression with nine dummy variables for each genotype performed better than the logistic regression with a linear gene × gene interaction model. Therefore, even though our comparison is limited to the logistic regression, the various choices of parameterization that we selected make our results reasonable and an exhaustive power comparison with various approaches will be conducted as our further work. Second, the proposed method has been applied only to the biallelic two-way gene × gene joint action model even though depending on the disease model, for instance, a three-way (or more) joint action model can be informative. The (composite) LD can be defined for multiple loci with multiple alleles [Yu and Wang, 2011; Zaykin et al., 2006, 2008], and our method can be extended to such a situation. For instance, for three-way joint action model, we can have three tests for the difference of minor allele frequencies, three tests for the presence of HWE, three tests for two-locus LE in cases, and one test for the presence of three-locus LE. However, the efficient way to construct the statistic for joint action analysis with the three types of components is not clear, and it will be investigated as our further work. Third, if LE and HWE are not preserved in controls, the independence of individual statistics is not guaranteed and the permutation was alternatively suggested. However for the genome-wide interaction analysis, the significance level is usually very small and thus permutation-based inference is limited at the genome-wide scale. These limitations will be investigated as part of our ongoing research.

Although GWASs have successfully identified many genetic variants for diseases, missing heritability reveals that the previous finding does not fully explain the genetic background for many diseases, and it may be partially attributable to the computational burden in the analysis of the gene × gene joint action. Our method improves the computational complexity for the analysis of gene × gene joint action at the genome-wide scale while preserving the statistically optimal efficiency, and therefore it may bridge the gap between statistical and computational issues for genome-wide studies.

Acknowledgments

  1. Top of page
  2. ABSTRACT
  3. Introduction
  4. Method
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References
  9. Appendix

This work was supported by the next-generation BioGreen 21 Program (no. PJ00901903), Rural Development Administration, Korea, and the National Research Foundation of Korea Grant funded by the Korean Government (NRF-2011-220-C00004). The authors would like to thank Dr. Robert C. Elston and the anonymous referees for their helpful suggestions and comments.

Thanks to all group investigators (Genetic Risk and Outcome in Psychosis): René S Kahn (1), Don H Linszen (2), Jim van Os (3), Durk Wiersma (4), Richard Bruggeman (4), Wiepke Cahn (1), Lieuwe de Haan (2), Lydia Krabbendam (3), and Inez Myin-Germeys (3). (1) Department of Psychiatry, Rudolf Magnus Institute of Neuroscience, University Medical Center Utrecht, Utrecht, The Netherlands (2) Academic Medical Centre University of Amsterdam, Department of Psychiatry, Amsterdam, The Netherlands (3) Maastricht University Medical Centre, South Limburg Mental Health Research and Teaching Network, Maastricht, The Netherlands (4) University Medical Center Groningen, Department of Psychiatry, University of Groningen, Groningen, The Netherlands.

References

  1. Top of page
  2. ABSTRACT
  3. Introduction
  4. Method
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References
  9. Appendix
  • Athanasiu L, Mattingsdal M, Kahler AK, Brown A, Gustafsson O, Agartz I, Giegling I, Muglia P, Cichon S, Rietschel M and others. 2010. Gene variants associated with schizophrenia in a Norwegian genome-wide study are replicated in a large European cohort. J Psychiatr Res 44(12):748753.
  • Baldi E, Burra P, Plebani M, Salvagnini M. 1993. Serum malondialdehyde and mitochondrial aspartate aminotransferase activity as markers of chronic alcohol intake and alcoholic liver disease. Ital J Gastroenterol 25(8):429432.
  • Barrett JC, Hansoul S, Nicolae DL, Cho JH, Duerr RH, Rioux JD, Brant SR, Silverberg MS, Taylor KD, Barmada MM and others. 2008. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease. Nat Genet 40(8):955962.
  • Birnbaum A. 1954. Combining independent tests of significance. J Am Stat Assoc 49(267):559574.
  • Cardno AG, Gottesman, II. 2000. Twin studies of schizophrenia: from bow-and-arrow concordances to star wars Mx and functional genomics. Am J Med Genet 97(1):1217.
  • Chen L, Yu G, Langefeld CD, Miller DJ, Guy RT, Raghuram J, Yuan X, Herrington DM, Wang Y. 2011. Comparative analysis of methods for detecting interacting loci. BMC Genomics 12:344.
  • Chiang CH, Su Y, Wen Z, Yoritomo N, Ross CA, Margolis RL, Song H, Ming GL. 2011. Integration-free induced pluripotent stem cells derived from schizophrenia patients with a DISC1 mutation. Mol Psychiatry 16(4):358360.
  • Chikkagoudar S, Wang K, Li M. 2011. GENIE: a software package for gene-gene interaction analysis in genetic association studies using multiple GPU or CPU cores. BMC Res Notes 4:158.
  • Chu TT, Liu Y, Kemether E. 2009. Thalamic transcriptome screening in three psychiatric states. J Hum Genet 54(11):665675.
  • Church C, Moir L, McMurray F, Girard C, Banks GT, Teboul L, Wells S, Bruning JC, Nolan PM, Ashcroft FM and others. 2010. Overexpression of Fto leads to increased food intake and results in obesity. Nat Genet 42(12):10861092.
  • Cook NR, Zee RY, Ridker PM. 2004. Tree and spline based association analysis of gene-gene interaction models for ischemic stroke. Stat Med 23(9):14391453.
  • Cordell HJ. 2002. Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Hum Mol Genet 11(20):24632468.
  • Cordell HJ. 2009. Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet 10(6):392404.
  • Csutora P, Karsai A, Nagy T, Vas B, L Kovács G, Rideg O, Bogner P, Miseta A. 2006. Lithium induces phosphoglucomutase activity in various tissues of rats and in bipolar patients. Int J Neuropsychopharmacol 9(5):613619.
  • Culverhouse R, Suarez BK, Lin J, Reich T. 2002. A perspective on epistasis: limits of models displaying no main effect. Am J Hum Genet 70(2):461471.
  • Djurovic S, Gustafsson O, Mattingsdal M, Athanasiu L, Bjella T, Tesli M, Agartz I, Lorentzen S, Melle I, Morken G and others. 2010. A genome-wide association study of bipolar disorder in Norwegian individuals, followed by replication in Icelandic sample. J Affect Disord 126(1–2):312316.
  • Dong C, Chu X, Wang Y, Jin L, Shi T, Huang W, Li Y. 2008. Exploration of gene-gene interaction effects using entropy-based methods. Eur J Hum Genet 16(2):229235.
  • First MB, Spitzer RL, Gibbon M, Williams JBW. 1994. Structured Clinical Interview for DSM-IV Axis I Disorders. New York: New York State Psychiatric Institute.
  • Gyenesei A, Moody J, Laiho A, Semple CA, Haley CS, Wei WH. 2012. BiForce toolbox: powerful high-throughput computational analysis of gene-gene interactions in genome-wide association studies. Nucleic Acids Res 40(1):W628W632.
  • Harold D, Abraham R, Hollingworth P, Sims R, Gerrish A, Hamshere ML, Pahwa JS, Moskvina V, Dowzell K, Williams A and others. 2009. Genome-wide association study identifies variants at CLU and PICALM associated with Alzheimer's disease. Nat Genet 41(10):10881093.
  • Jiang R, Tang W, Wu X, Fu W. 2009. A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinformatics 10(Suppl 1):S65.
  • Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST and others. 2005. Complement factor H polymorphism in age-related macular degeneration. Science 308(5720):385389.
  • Krawczak M, Nikolaus S, von Eberstein H, Croucher PJ, El Mokhtari NE, Schreiber S. 2006. PopGen: population-based recruitment of patients and controls for the analysis of complex genotype-phenotype relationships. Community Genet 9(1):5561.
  • LeBlanc M, Kulle B, Sundet K, Agartz I, Melle I, Djurovic S, Frigessi A, Andreassen OA. 2012. Genome-wide study identifies PTPRO and WDR72 and FOXQ1-SUMO1P1 interaction associated with neurocognitive function. J Psychiatr Res 46(2):271278.
  • Li W, Reich J. 2000. A complete enumeration and classification of two-locus disease models. Hum Hered 50(6):334349.
  • Little RC, Folks L. 1971. Asymptotic optimality of Fisher's method of combining independent tests. J Am Stat Assoc 66(336):986998.
  • Liu J, Ulloa A, Perrone-Bizzozero N, Yeo R, Chen J, Calhoun VD. 2012. A pilot study on collective effects of 22q13.31 deletions on gray matter concentration in schizophrenia. PLoS One 7(12):e52865.
  • Marchini J, Donnelly P, Cardon LR. 2005. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet 37(4):413417.
  • McGuffin P, Farmer A, Harvey I. 1991. A polydiagnostic application of operational criteria in studies of psychotic illness. Development and reliability of the OPCRIT system. Arch Gen Psychiatry 48(8):764770.
  • Nielsen DM, Ehm MG, Weir BS. 1998. Detecting marker-disease association by testing for Hardy-Weinberg disequilibrium at a marker locus. Am J Hum Genet 63(5):15311540.
  • Nielsen DM, Ehm MG, Zaykin DV, Weir BS. 2004. Effect of two- and three-locus linkage disequilibrium on the power to detect marker/phenotype associations. Genetics 168(2):10291040.
  • Park MY, Hastie T. 2008. Penalized logistic regression for detecting gene interactions. Biostatistics 9(1):3050.
  • Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ and others. 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81(3):559575.
  • Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. 2001. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet 69(1):138147.
  • Sasieni PD. 1997. From genotypes to genes: doubling the sample size. Biometrics 53(4):12531261.
  • Schmermund A, Mohlenkamp S, Stang A, Gronemeyer D, Seibel R, Hirche H, Mann K, Siffert W, Lauterbach K, Siegrist J and others. 2002. Assessment of clinically silent atherosclerotic disease and established and novel risk factors for predicting myocardial infarction and cardiac death in healthy middle-aged subjects: rationale and design of the Heinz Nixdorf RECALL Study. Risk Factors, Evaluation of Coronary Calcium and Lifestyle. Am Heart J 144(2):212218.
  • Spitzer RL, Endicott J, Robins E. 1978. Research diagnostic criteria: rationale and reliability. Arch Gen Psychiatry 35(6):773782.
  • Sullivan PF. 2005. The genetics of schizophrenia. PLoS Med 2(7):e212.
  • Thwaites DT, Anderson CM. 2007. H+-coupled nutrient, micronutrient and drug transporters in the mammalian small intestine. Exp Physiol 92(4):603619.
  • Tsuang MT, Nossova N, Yager T, Tsuang MM, Guo SC, Shyu KG, Glatt SJ, Liew CC. 2005. Assessing the validity of blood-based gene expression profiles for the classification of schizophrenia and bipolar disorder: a preliminary report. Am J Med Genet B Neuropsychiatr Genet 133B(1):15.
  • Ueki M, Cordell HJ. 2012. Improved statistics for genome-wide interaction analysis. PLoS Genet 8(4):e1002625.
  • Wang J, Shete S. 2008. A test for genetic association that incorporates information about deviation from Hardy-Weinberg proportions in cases. Am J Hum Genet 83(1):5363.
  • Wang T, Zhu X, Elston RC. 2007. Improving power in contrasting linkage-disequilibrium patterns between cases and controls. Am J Hum Genet 80(5):911920.
  • Wang X, Elston RC, Zhu X. 2010. The meaning of interaction. Hum Hered 70(4):269277.
  • Wang X, Elston RC, Zhu X. 2011. Statistical interaction in human genetics: how should we model it if we are looking for biological interaction? Nat Rev Genet 12(1):74.
  • Weber H, Kittel-Schneider S, Gessner A, Domschke K, Neuner M, Jacob CP, Buttenschon HN, Boreatti-Hummer A, Volkert J, Herterich S and others. 2011. Cross-disorder analysis of bipolar risk genes: further evidence of DGKH as a risk gene for bipolar disorder, but also unipolar depression and adult ADHD. Neuropsychopharmacology 36(10):20762085.
  • Weir BS. 1996. Genetic Data Analysis II: Methods for Discrete Population Genetic Data. Sunderland, MA: Sinauer Associates.
  • Wichmann HE, Gieger C, Illig T. 2005. KORA-gen–resource for population genetics, controls and a broad spectrum of disease phenotypes. Gesundheitswesen 67(Suppl 1):S26S30.
  • Won S, Elston RC. 2008. The power of independent types of genetic information to detect association in a case-control study design. Genet Epidemiol 32(8):731756.
  • Won S, Kim S, Elston RC. 2009a. Phase uncertainty in case-control association studies. Genet Epidemiol 33(6):463478.
  • Won S, Morris N, Lu Q, Elston RC. 2009b. Choosing an optimal method to combine P-values. Stat Med 28(11):15371553.
  • Wu X, Dong H, Luo L, Zhu Y, Peng G, Reveille JD, Xiong M. 2010. A novel statistic for genome-wide interaction analysis. PLoS Genet 6(9):e1001131.
  • Wu YW, Prakash KM, Rong TY, Li HH, Xiao Q, Tan LC, Au WL, Ding JQ, Chen SD, Tan EK. 2011. Lingo2 variants associated with essential tremor and Parkinson's disease. Hum Genet 129(6):611615.
  • Yu Z, Wang S. 2011. Contrasting linkage disequilibrium as a multilocus family-based association test. Genet Epidemiol 35(6):487498.
  • Yung LS, Yang C, Wan X, Yu W. 2011. GBOOST: a GPU-based tool for detecting gene-gene interactions in genome-wide case control studies. Bioinformatics 27(9):13091310.
  • Zaykin DV. 2004. Bounds and normalization of the composite linkage disequilibrium coefficient. Genet Epidemiol 27(3):252257.
  • Zaykin DV, Meng Z, Ehm MG. 2006. Contrasting linkage-disequilibrium patterns between cases and controls as a novel association-mapping method. Am J Hum Genet 78(5):737746.
  • Zaykin DV, Pudovkin A, Weir BS. 2008. Correlation-based inference for linkage disequilibrium with multiple alleles. Genetics 180(1):533545.
  • Zerba KE, Ferrell RE, Sing CF. 2000. Complex adaptive systems and human health: the influence of common genotypes of the apolipoprotein E (ApoE) gene polymorphism and age on the relational order within a field of lipid metabolism traits. Hum Genet 107(5):466475.
  • Zhang Y, Liu JS. 2007. Bayesian inference of epistatic interactions in case-control studies. Nat Genet 39(9):11671173.
  • Zhao J, Jin L, Xiong M. 2006. Test for interaction between two unlinked loci. Am J Hum Genet 79(5):831845.
  • Zhu X, Gu H, Liu Z, Xu Z, Chen X, Sun X, Zhai J, Zhang Q, Chen M, Wang K and others 2013. Associations between TCF4 gene polymorphism and cognitive functions in schizophrenia patients and healthy controls. Neuropsychopharmacology 38(4):683689.

Appendix

  1. Top of page
  2. ABSTRACT
  3. Introduction
  4. Method
  5. Results
  6. Discussion
  7. Acknowledgments
  8. References
  9. Appendix

Data Description: Schizophrenia

Samples were obtained from either German or Dutch patients who provided written informed consent. German SCZ patients (na = 487), who were all of German descent, were recruited from consecutive hospital admissions. Lifetime best estimate diagnoses were assigned according to DSM-IV criteria based on structured interviews [First et al., 1994; McGuffin et al., 1991; Spitzer et al., 1978], medical records, and family history. Best estimate diagnoses were assigned by at least two experienced psychiatrists/psychologists. The controls were drawn from three population-based epidemiological studies: (i) PopGen [Krawczak et al., 2006] (nu = 490), (ii) KORA [Wichmann et al., 2005] (nu = 488), and (iii) the Heinz Nixdorf Recall (HNR, nu = 383) study [Schmermund et al., 2002]. Inpatients and outpatients (na = 804) in the Netherlands were recruited from various psychiatric hospitals and institutions throughout Utrecht and Rotterdam. All the patients were diagnosed with the DSM-IV definition of SCZ. The controls from Utrecht (nu = 704) were volunteers with no history of psychiatric disorders, and the Rotterdam control individuals (nu = 2,302) were drawn from a large population-based project on the genetics of complex traits and diseases.

The GWAS dataset was provided by analysis at seven genotyping facilities by using HumanHap550v3 BeadArrays with the Infinium II assay (Illumina, San Diego, CA, USA). We developed a protocol of filters for the stringent quality control of whole-genome and sub-whole-genome datasets that accounted for issues such as call rates, heterozygosity, cross-contamination, population stratification, relatedness, nonrandom missingness, HWE, and minor allele frequency. The quality-control protocol was applied to a total of 1,291 patients and 4,367 controls for the GWAS step. Consequently, 375 individuals, including 122 patients and 253 controls as well as 86,037 SNPs, were excluded prior to the association analysis. Therefore, the final GWAS dataset for gene × gene interaction analysis comprised 464,030 autosomal SNPs (including PAR1+2), genotyped in 1,169 SCZ patients and 3,714 controls.