SEARCH

SEARCH BY CITATION

Keywords:

  • Case-control studies;
  • U-statistic;
  • kernel function;
  • haplotype-based association;
  • internalising disorder

Summary

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgement
  8. References
  9. Appendix I
  10. Appendix II
  11. Supporting Information

In case-control studies, association analysis was designed to test whether genetic variants were associated with human diseases. To evaluate the association, analysing one genetic marker at a time suffered from weak power, because of the correction for multiple testing and possibly small genetic effects. An alternative strategy was to test simultaneous effects of multiple markers, which was believed to be more powerful. However, when the number of markers under investigation was large, they would be subjected to weak power as well, because of the greater degrees of freedom. To conquer these limitations in case-control studies, we proposed a novel method that could test joint association of several loci (i.e. haplotype), with only a single degree of freedom. In this research, we developed a nonparametric approach, which was based on U-statistics. We also introduced a new kernel for U-statistic, which could combine the haplotype structure information, and was expected to enhance the power. Simulations indicated that our proposed approach offered merits in identifying the associations between diseases and haplotypes. Application of our method to a study of candidate genes for internalising disorder illustrated its virtue in utility and interpretation, and provided an excellent result in detecting the associations.


Introduction

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgement
  8. References
  9. Appendix I
  10. Appendix II
  11. Supporting Information

Association analysis, with the aim of investigating genetic variations, is often used to test for genetic associations with human diseases in case-control studies (Bertranpetit & Calafell, 1996; Fallin et al., 2001; Schaid et al., 2002; Epstein & Satten, 2003; Clark, 2004; Lin, 2004; Zeng & Lin, 2005; Epstein & Kwee, 2009). The power of these methods depends on the frequency of disease allele(s), the allele frequencies of genetic markers and the magnitude of linkage disequilibrium (LD) between the markers and disease loci (Zondervan & Cardon, 2004; Schaid et al., 2005).

To evaluate associations between genetic markers and diseases, traditional methods analyse one genetic marker at a time. For example, we can use Armitage's trend test first to compare the allele frequencies at each locus, and then adjust for multiple testing (Sasieni, 1997). Single-marker analyses are likely to be most powerful for Mendelian diseases, where there is only a single marker strongly associated with disease. However, for complex human diseases, such approaches will suffer from weak power, because of relatively small genetic effects on human diseases and the need to correct for multiple testing (Schaid, 2004; Schaid et al., 2005). From this viewpoint, we altered to test the effects of multiple markers. For example, we can use a logistic regression model first, and then test simultaneously the main effects and possibly the interactions of multiple markers (Fan & Knapp, 2003). Although these methods can be more powerful than single-marker methods (Longmate, 2001), when there are many markers, they are also subject to weak power due to the high degrees of freedom (d.f.; Fallin & Schork, 2000; Schaid, 2004).

Furthermore, when modelling the genes and complex interrelationships amongst genes simultaneously, parametric approaches would encounter too many parameters, possibly causing multicollinearity and model instability (Lohmueller et al., 2003; Neale & Sham, 2004; Schaid et al., 2005). In addition, Bayesian methods are also used in the literature (Morris et al., 2000; Liu et al., 2001; Conti et al., 2003), but it is difficult to evaluate whether the complex model is overfitted (Schaid et al., 2005).

To overcome these limitations and improve the power over that of traditional methods, we developed a nonparametric approach, which simultaneously tests joint association of multiple loci, with only a single degree of freedom. Furthermore, our proposed method is based on U-statistics, which has an asymptotically normal distribution. We expect that this approach can combine haplotype structure information, which is believed to provide high resolution for detecting associations between genetic variants and diseases. With this in mind, we introduce a new kernel function to measure pairs of haplotypes, called EGS (defined later), which learns completely from the data, and does not make any extra illogical assumptions.

To demonstrate the validity and statistical properties of the proposed approach, we perform a series of simulation studies. It is implied that our approach is more powerful and flexible to identify associations between diseases and haplotypes, which might be caused by the extra information from the new kernel EGS. Application of our approach to a study of candidate genes for internalising disorder illustrates its virtues in utility and interpretation, and also results in interesting findings.

Materials and Methods

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgement
  8. References
  9. Appendix I
  10. Appendix II
  11. Supporting Information

Basic Statement and Description

Consider a sample of n1 cases and n2 controls, at L SNP markers. Let Hi be a random vector, which denoted the ith sampled haplotype. Let inline image for a case haplotype, and inline image for a control haplotype (i = 1, 2,…, 2n1; j = 1, 2,…, 2n2; h = 1, 2,…, m, where m was the number of distinct haplotypes in the data, including all cases and controls), and inline image.

A Pearson's χ2 test with m − 1 d.f. provided one possible way to determine whether the two haplotype distributions differed, that is, H0 : inline image versus H1 : not H0 (where “∼” represents identically distributed). However, when the number of SNPs increased, this approach suffered from weak power owing to the large number of d.f.. Alternatively, statistics which were based on pairwise comparisons between haplotypes from cases and controls, used only 1 d.f., and amongst these, the U-statistic was one of the general forms. Let inline image, where Ud and Uc were the U-statistic for case and control haplotypes, respectively. Thus, the problem above was equivalent to H0: inline image versus H1: inline image.

U-Statistic for Within-Group Haplotypes

Firstly, we need to introduce some notations. Let inline image, where inline imagewas a random variable representing the lth locus at the ith sampled haplotype (i = 1, 2,…, n; l = 1, 2,…, L; here n was the number of haplotypes within one group; in other words, if cases were considered, n = 2n1, otherwise n = 2n2 for controls). Let K(Hi, Hj) be a symmetric kernel function that compared the ith and jth haplotypes. Hence, we could write out the general form of a U-statistic for within-group haplotype as:

  • display math(1)

where inline image was the kernel function for locus l, and inline image represented the U-statistic for locus l. From the formula, the test statistic can be decomposed into summation of U-statistic for each locus, so we can analyse many markers no matter whether there are many distinct haplotypes. In particular, it was explicit that we used different kernel functions for different loci, that is to say, treat each locus discriminately. By means of appropriate kernel function, our aim was that those loci, which had a larger possibility to associate with disease, were expected to contribute more to the U-statistic. Based on this expectation, the key step of using this statistic was how to design a proper kernel function.

Choice of Kernels

Before we introduced our new kernel function, we first reviewed some commonly used kernel functions.

By defining K(Hi, Hj) to be 1 if the haplotypes matched and 0 otherwise, we obtained the “matching measure.” The “length measure” could be obtained by defining K(Hi, Hj) to be the length spanned by the longest continuous interval of matching alleles. These two measures were not robust to genotyping errors, missing data and recent mutations (Tzeng et al., 2003). In addition, the two measures could not treat different loci discriminatingly. Finally, one commonly used kernel function called “allele match” (also known as “counting measure”) was defined as the number of alleles in common between the two haplotypes, denoted “AM” for brevity. It was robust to genotyping errors, missing data and recent mutations. Furthermore, AM could be thought of as a compromise between the matching and the length measures.

Obviously, there were many other kernel functions also frequently used in the literature, but we did not list all of these here. From an evolutionary viewpoint, the case haplotypes had a greater tendency to cluster together on the genealogy, than did the control haplotypes. That was because the time taken for case haplotypes to trace back to their common ancestor (with disease mutations), was much shorter than that for the whole population to trace back to the common ancestor (Zöllner & Pritchard, 2005). Therefore, case haplotypes might have excess matching, or unusual haplotype sharing. Based on these, the kernel AM seemed to be better. However, the AM function had assumed that virtually all the loci were independent of each other, and did not distinguish each different locus. In order to combine the locus information, we introduced the following new kernel function:

The new kernel: EGS

Although we were using a nonparametric method, we took inspiration from the parametric entropy-guided distance (Jin et al., 2010), and generalised the AM kernel, so that the new kernel could incorporate the information of the locus itself and the structure information amongst loci. The new kernel function was defined as

  • display math(2)

where inline image (i, j = 1, 2,…, n; l = 1, 2,…, L; xil = 0, 1), and inline image represented all the loci before inline image (by default, the natural position of the loci was regarded as the perfect order), M was a specified model followed by the loci under investigation, such as independent model or first-order Markov chain. inline image was the maximum likelihood estimation (MLE) of inline image. It was clear that this kernel could also be thought of as a measure guided by entropy, and was denoted “EGS” (Shrinkage of Entropy-Guided distance) for brevity.

This new EGS kernel possessed an important statistical feature: the statistic δ using EGS, could be viewed as a formation of weighted log odds. Taking one single locus as an example, assume the allele frequencies for cases and controls were as follows:

 CaseControl
Allele 0π11π12
Allele 1π21π22

Let Kd and Kc be kernel functions for cases and controls, respectively, where inline image and inline image (as a format of |log(odds)|, and the reason we used the absolute value was to avoid the balance of log(odds) with opposite sign, when summarising across the loci); inline image, so if Kd and Kc had the same sign, inline image(In practice, Kd and Kc with the same sign was generally the truth, because the frequencies of risk allele in cases and controls were always below 50%, due to the rareness of the mutation.) Therefore, in the situation of one single locus, inline image, inline image, where ω1 and ω2 were functions of πij, inline image So the statistic inline image could be expressed as weighted log odds.

Global Test Statistic and Its Distribution

As we stated, to test whether the haplotype distributions between cases and controls differed, we applied the U-statistic to construct hypotheses H0: inline image versus H1: inline image, where inline image. Hence, the global test statistic was

  • display math(3)

And, Zglobal had an asymptotic standard normal distribution under the null hypothesis, where inline image, inline image were the variances of the U-statistics for cases and controls, respectively. To determine the asymptotic variances of Ud and Uc, we could use standard results on U-statistics (Hoeffding, 1948; Mao et al., 1998).

The statistic inline image, where Ul was the U-statistic for locus l, was similar as we had introduced before. Then the variances inline image (calculation for inline image was the same) under independent locus model could be written as

  • display math(4)

where inline image, inline image.

Let inline image denote the MLE of the probabilityinline image, and let inline image denote the MLE of the joint probability inline image(inline image = 1, 2, 3, 4 corresponding to (xs, xt) = (0, 0), (0, 1), (1, 0), (1, 1)). Then, the asymptotic variances could be expressed as

  • display math(5)
  • display math

For more details, see Appendix II. Supporting Information also shows how to calculate this variance under one-order Markov chain, when using EGS as the kernel for U-statistics. And, this calculation could be generalised simply and similarly for other locus models.

Results

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgement
  8. References
  9. Appendix I
  10. Appendix II
  11. Supporting Information

Simulation Design

We designed a detailed simulation study to evaluate the performance of our proposed method, denoted “U-EGS” (U-statistic based on EGS kernel) for brevity, by comparing it with the traditional U-statistic based on AM kernel (denoted as “U-AM”). To provide a comprehensive study and demonstrate the validity of the proposed approach, we performed extensive simulations in a variety of settings with different parameters as well.

We generated case-control samples using a real data set as input, so that the simulated data had LD patterns similar to those of the real data set. The data set comprised 25 distinct haplotypes, at 11 diallelic markers, and the frequency of haplotypes in the population were known (Yeager et al., 2007). The concrete procedure to generate our simulated data was as follows:

  1. Assign the parameters in the model. For example, the prevalence Kp = P(Y = 1), risk allele, genotype relative risk RR, sample size for cases and controls.
  2. Calculate the genotype frequency and corresponding penetrance at risk locus (with the assumption of Hardy-Wenberg Equilibrium), according to the real data set and the parameters assigned in step 1.
  3. Sample two haplotypes from the haplotype pool of the real data set randomly (assume that samples followed a multinomial distribution with haplotype frequencies as the parameters), and combine the selected two haplotypes as an individual.
  4. Determine the phenotype (case or control, assume that phenotype followed a Bernoulli distribution, that is, 0–1 distribution) of the individual in step 3, according to its genotype at risk locus and the penetrance (in step 2).
  5. Repeat steps 3 to 4, until the number of cases and controls assigned in step 1 was completed.

We performed the two methods, U-EGS and U-AM, respectively, under 360 different settings, comparing their performances and evaluating the influence on them caused by the parameters. For each determination of the disease model and parameter, we generated simulated data with sample size 100, 200, 400, 1000 and 2000, each of which contained equal cases and controls. The risk allele that we chose was the minor allele at locus 6. Four kinds of disease models were considered: multiplicative model (RR1 = RR, RR2 = RR2), additive model (RR1 = RR, RR2 = 2RR), dominant model (RR1 = RR2 = RR) and recessive model (RR1 = 1, RR2 = RR). The parameters were: prevalence Kp = 0.02, 0.002, 0.0002, genotype relative risk, RR = 1.2, 1.3, 1.4, 1.5, 1.6, 1.7 (Kp and RR could not be assigned freely, and the constraints are given in Appendix I). Thus, there were 5 (sample sizes) × 4 (disease models) × 3 (prevalences) × 6 (genotype relative risk) = 360 different settings in all. We also investigated the comparisons between U-EGS and other existing methods.

Comparing U-EGS with U-AM and Other Methods

We performed detailed studies of the empirical power (at 5% nominal significance level by default without special illustration) and corresponding type-I error rates, using the two methods—U-EGS and U-AM. In particular, because the AM kernel function viewed the loci as independent by default, in order to be fair and comparable, we used an independent model for EGS as well.

Firstly, the prevalence Kp had little effect on the power of the two methods (Table 1). Theoretically, for fixed genotype frequencies, RR and disease model, larger prevalence meant larger penetrance, which indicated a larger proportion of affected individuals carrying the risk allele. Thus, the performances for both methods confirmed this viewpoint: the power with Kp = 0.02 outperformed that with Kp = 0.002 and Kp = 0.0002 for both methods and all disease models (the latter two cases were similar but not significant). Furthermore, the power of the recessive model had less significant difference under different prevalences than that of the other three models.

Table 1. Power to detect disease–marker associations under different models and prevalences (Kp). The sample size is 400
  Multiplicative modelAdditive modelDominant modelRecessive model
KpRRU-AMU-EGSU-AMU-EGSU-AMU-EGSU-AMU-EGS
0.021.20.1300.1270.2510.2960.1040.1030.0540.046
 1.30.2240.2640.3660.4320.1520.1850.0610.053
 1.40.3400.3910.4780.5970.2430.2940.0570.058
 1.50.4720.5650.5800.6900.3570.4180.0680.059
 1.60.6090.7010.6600.7880.4220.5290.0740.074
 1.70.7070.8310.7680.8600.5380.6490.0810.077
0.0021.20.1250.1450.2570.3260.1120.1110.0630.055
 1.30.1800.2620.3540.4450.1720.1960.0640.054
 1.40.3190.3960.4490.5780.2110.2710.0500.054
 1.50.4300.5310.5480.6880.3140.3890.0650.081
 1.60.5430.6610.6540.8010.3980.5020.0650.074
 1.70.6810.8020.7690.8710.4830.6220.0830.067
0.00021.20.1220.1380.2250.2880.0850.1130.0520.058
 1.30.2170.2320.3430.4290.1180.1540.0450.050
 1.40.2990.3980.4680.5760.2320.2890.0480.064
 1.50.4320.5390.5340.6690.2940.3620.0710.070
 1.60.5410.7030.6650.8030.4040.5010.0700.068
 1.70.6650.8020.7240.8740.4890.6160.0710.085

Next we investigated the genotype relative risk, RR. Similarly, for fixed genotype frequencies, Kp and disease model, larger RR also meant larger penetrance. As shown in Table 1, RR had a primary influence on the results. For fixed disease models, the empirical power for both methods increased with increasing RR, where the largest power was attained at the greatest RR. The changing trend of U-EGS and U-AM were similar, but the power of U-EGS was greater than that of U-AM under the same parameters, for the overwhelming majority of situations. In the Supporting Information (Figs S1–S3), we list visualised figures for a sample size of 400.

Disease model was another important factor that had a significant impact on the performances. The trends for each disease model were alike under the same Kp and RR, where the least power was attained for the recessive model. Moreover, for the multiplicative, additive and dominant models, U-EGS almost had greater power than U-AM. For the recessive model, both methods had nearly the same low power. Thus it was believed that this excess power may be driven by the extra locus information incorporated by the EGS kernel. In addition, Tables S1–S4 show the results with other different sample sizes, and the aforesaid analysis of influencing factors was similar.

Finally, we compared the performances of the two methods with different sample sizes. Figures 1-4 list the empirical power of both methods, as a function of sample size for four disease models, respectively. The power of both methods decreased as the sample size dropped, but the declining speed of U-AM was much greater than that of U-EGS for the multiplicative, additive and dominant models. For the recessive model, both methods had nearly the same low power. This indicated that the U-EGS method outperformed the U-AM method in practice, because the sample size was usually limited in the real data, as a result of the high cost. Therefore, our proposed method U-EGS had evident superiority in utility. It was worth noting that the true model followed by the loci might not be independent or Markovian, but the EGS kernel, which combined information amongst loci (even if partially or approximately true information) could enhance the power in identifying disease–marker associations.

image

Figure 1. Power for detecting disease-marker associations using U-AM and U-EGS, as a function of sample size under multiplicative model, when prevalence is 0.002.

Download figure to PowerPoint

image

Figure 2. Power to identify disease-marker associations using U-AM and U-EGS, as a function of sample size under additive model, when prevalence is 0.002.

Download figure to PowerPoint

image

Figure 3. Power to detect disease-marker associations using U-AM and U-EGS, as a function of sample size under dominant model, when prevalence is 0.002.

Download figure to PowerPoint

image

Figure 4. Power for identifying disease-marker associations using U-AM and U-EGS, as a function of sample size under recessive model, when prevalence is 0.002.

Download figure to PowerPoint

Table 2 showed the type-I error rates for U-EGS under different parameters, at different nominal significance levels. We can see that U-EGS is robust, that is, almost all of the type-I error rates were around the nominal significance levels. In genetic/genomic studies, it was typical to test multiple genes, so we also investigated several lower alpha levels for type-I error rates, which provided similar conclusions.

Table 2. Type-I error rates using U-EGS under different prevalences and sample sizes (α is the nominal significant level)
  Type-I error rates
KpSample sizeα = 0.05α = 0.01α = 0.005
0.021000.0420.0110.0063
 2000.0540.0110.0060
 4000.0340.0110.0063
 10000.0420.0030.0028
 20000.0350.0060.0019
0.0021000.0420.0100.0049
 2000.0390.0110.0055
 4000.0520.0100.0050
 10000.0320.0070.0046
 20000.0300.0040.0033
0.00021000.0390.0080.0045
 2000.0450.0080.0062
 4000.0300.0090.0058
 10000.0470.0050.0026
 20000.0370.0030.0028

Finally, we also compared U-EGS with the Pearson χ2 test, the SKAT (SNP-set Kernel Association Test) test proposed by Wu et al. (2010, 2011) and the nonparametric method proposed by Sha et al. (2007) using the simulated data sets mentioned above. Figure 5 showed the power to detect disease–haplotype associations, when the sample size was 400 and the prevalence Kp was 0.02. Table S5 listed the exact values of those cases plotted in Figure 1, and Table S6 lists the power when the sample size was 100, which gave similar results. We could see that U-EGS outperformed the previously discussed methods for almost all settings. The reason for this might be our use of the new kernel EGS, which was established to be more powerful.

image

Figure 5. Power for detecting disease-marker associations using different methods (sample size is 400, prevalence Kp = 0.02). Sm, Sl, Sc denoted the method in Sha, et al. (2007) using match measure, length measure and counting measure, X2 denoted the Pearson chi-square test, SKAT1 and SKAT2 denoted the method proposed by Wu, et al. (2010) using linear kernel and linear-weighted kernel (without covariates).

Download figure to PowerPoint

Application to Internalising Disorder Data

Internalising disorder is a group of psychiatric diseases with anxiety and fear as the major clinical symptoms, including depression, generalised anxiety disorder, panic disorder, obsessive-compulsive disorder, special phobia, agoraphobia and social phobia, which is also one of the most common psychological problems for undergraduates.

This study used saliva samples from a comprehensive university undergraduates’ mental health survey in Jilin Province in 2007. We extracted genomic DNA from 228 cases and 206 controls (without missing data), and detected polymorphisms of 10 tagSNPs in the BDNF, MAOA and SLC6A4 genes using the polymerase chain reaction-ligase detection reaction method. The BDNF gene was localised to the boundary of Chromosome 11p13 and 14 (Hanson et al., 1992), and its selected tagSNPs were rs2030324, rs12273539 and rs10835210; the MAOA gene was located at Chromosome Xp11.3, and the selected tagSNP was rs2283724; the SLC6A4 gene was located at Chromosome 11q11.1—11.2, and its selected tagSNPs were rs6354, rs3794808, rs1180122, rs2020942, rs2020939 and rs12449783 (Table 3 shows the P-values under a single SNP test, through constructing a 2 × 2 contingency table). Furthermore, the selected loci could represent 100% of the genetic information of all tagSNPs in the corresponding genes (Meng et al., 2009).

Table 3. P-values for 10 SNPs in internalising disorder data, under single SNP test
GenesSNPsP-values
BDNFrs122735390.221393
 rs108352100.000133
 rs20303240.048357
MAOArs22837240.016402
SLC6A4rs124497830.905811
 rs37948080.788047
 rs20209420.530072
 rs110801220.467264
 rs63540.215769
 rs20209390.137946

Table 4 lists the P-values yielded by methods U-AM and U-EGS, applied to the aforementioned data. As shown in the table, there was no association between any of the three genes and internalising disorder (P > 0.05/3) using U-AM. However, there was strong evidence that the BDNF (P = 0.003634) and MAOA (P < 0.000001) genes were associated with internalising disorder using U-EGS; there was no association between the SCL6A4 gene and internalising disorder (P = 0.966670) using U-EGS.

Table 4. P-values for three genes in internalising disorder data, using U-EGS and U-AM, respectively
 Methods
GenesU-AMU-EGS
BDNF0.0342960.003634
MAOA0.021765<0.000001
SLC6A40.6099120.966670

The protein encoded by BDNF is a member of the nerve growth factor family. It is induced by cortical neurons, and is necessary for the survival of striatal neurons in the brain. The protein encoded by the monoamine oxidase A gene (MAOA) is an enzyme that degrades amine neurotransmitters, such as dopamine, norepinephrine and serotonin. Studies on associations between the BDNF or MAOA genes and mental disorder had been generally reported in the literature. In addition, extensive studies and considerable evidence suggested that the BDNF and MAOA genes had played a role in, and might participate in, the occurrence of mental disorders. (Pandey et al., 2008; Doornbos et al., 2009; Real et al., 2009; Dwivedi, 2010; Fan et al., 2010; Zhang et al., 2010). In this study, we did find that the BDNF and MAOA genes were associated with internalising disorder using U-EGS, which was consistent with the results in the literature.

The content and dysfunction of 5-HT in the central nervous system might associate with the occurrence of mental disorder. SLC6A4 is the transporter gene of 5-HT, which has a direct effect on the content of 5-HT. Extensive research results implied that the SLC6A4 gene might play a part in the occurrence of mental disorder to some extent; nevertheless, this effect might be somewhat weak (Wendland et al., 2007; Bloch et al., 2008; Gressier et al., 2009; McEachin et al., 2010). In our study, SLC6A4 had no association with internalising disorder (P = 0.966670) using U-EGS, which was also consistent with the results in the literature.

Discussion

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgement
  8. References
  9. Appendix I
  10. Appendix II
  11. Supporting Information

Haplotype-based association studies will probably continue to play an important role in the study of complex human diseases (Schaid, 2004; Tzeng et al., 2006). Amongst these various methods, nonparametric methods are not constrained by the problems caused by modelling, such as overfitting, multicollinearity and model instability due to the presence of too many parameters (Schaid et al., 2005). In this article, we put forward a novel nonparametric approach based on the U-statistic, U-EGS, to detect disease–haplotype associations in case-control studies. We develop a new kernel for the U-statistic, EGS, to measure the similarities between pairs of haplotypes, which can incorporate haplotype structure information.

Our approach can successfully gain more power to identify the associations between haplotypes and diseases. We have demonstrated the utility of our method via a detailed simulation study, and these results have shown that the EGS kernel indeed captured some extra haplotype information, which is believed to improve the power to detect disease–haplotype associations. In particular, we suggest that the proposed method, U-EGS, should be adopted when the sample size is small, according to the simulation studies.

In this research, our proposed approach is aimed at the statistical analysis of case-control studies, so the U-EGS method cannot work when disease traits have multilevel or population stratification. Under these circumstances, we can try to model haplotype similarities (EGS) between individuals within the same level or layer. For example, we may consider a one-way ANOVA (Mukhopadhyay et al., 2010),

  • display math(6)

where Kg(Hi, Hj) is the kernel function (EGS) between haplotype Hi and Hj (i < j = 1, 2,…, 2ng) within the level g (g = 1, 2,…, G), α is the overall grand mean or general effect for pairs of individuals, βg is the group-specific treatment effect for similarity scores over the general effect and inline image are the error components. Thus, the associations of haplotypes with disease traits now can be detected by testing inline image versus H1 : not H0.

Further, our method U-EGS outperforms U-AM in the internalising disorder study. Using U-AM, there is no association between internalising disorder and all three genes under investigation. However, using U-EGS, we have discovered that the BDNF and MAOA genes are associated with internalising disorder, while there was no observed association for the SLC6A4 gene. The increased effect in identifying haplotype–disease association may be driven by information that the EGS kernel incorporated from the haplotypes.

Moreover, our proposed approach is aimed at the statistical analysis of haplotype data in case-control studies. In simulation studies, we generated haplotype samples directly, and for the internalising disorder data, we used the EM algorithm to infer haplotypes from genotypes. Undoubtedly, using these estimated haplotypes may introduce additional measurement error, which may result in biased conclusions (Sha et al., 2005; Zeng & Lin, 2005). Therefore, accounting for the genotype data directly in our method is worthy of further research.

Finally, complex human diseases are believed to be influenced by genetic and environmental factors, and/or their interactions. However, neither our proposed approach U-EGS, nor U-AM, can incorporate environmental or other covariates simultaneously, as well as the interactions. Thus, the question of how to investigate genetic and environmental factors and/or their interactions together (Niu et al., 2011), reasonably and correctly at the same time, is an issue in our future research plan. The ability to incorporate covariates in a nonparametric method may offer guidance (Kwee et al., 2008; Wu et al., 2010, 2011, 2013; Zhu et al., 2012).

In summary, U-EGS can test the simultaneous effects of multiple markers, and can have significant superiority especially for a small sample size. Our approach is also highly efficient in identifying disease–haplotype associations, through combining the structure information of haplotypes, as well as being extremely fast to perform.

Acknowledgement

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgement
  8. References
  9. Appendix I
  10. Appendix II
  11. Supporting Information

This work was supported by the National Natural Science Foundation of China (no. 11025102, 11001044, 11101182, 11371083 and 11301213); the Fundamental Research Funds for the Central Universities (no. 11CXPY007, 10JCXK001); Natural Science Foundation of Jilin Province (no. 201215007); a Jilin Project (no. 20100401); Scientific Research Foundation of Returned Scholars, MOE of China; Program for Changjiang Scholars and Innovative Research Team in University.

References

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgement
  8. References
  9. Appendix I
  10. Appendix II
  11. Supporting Information

Appendix I

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgement
  8. References
  9. Appendix I
  10. Appendix II
  11. Supporting Information

Constraints amongst Parameters When Producing the Simulated Data Sets

Take one single locus for example, assume that allele “A” was the risk allele, whose allele frequency was q, and allele “a” was the major allele, whose allele frequency was p = 1−q. Also we assumed the population was under Hardy-Wenberg Equilibrium. Let fi (i = 0, 1, 2) represent penetrance (fi∈[0, 1]), and Kp = P(Y = 1) denote prevalence. Let RR1 = f1/f0, RR2 = f2/f0 be the first and second genotype relative risk (i.e. heterozygous GRR and homozygous GRR, RR2 ≥ RR1 ≥ 1), respectively.

  • math image

Therefore,

  • math image

Obviously, these penetrances satisfied: 1 ≥ f2f1f0 ≥ 0, hence we only need to confirm that f2 ≤ 1. Now we discussed the constraints amongst parameters under different models, respectively.

Multiplicative Model

Multiplicative model required that RR2 = RR12.

  • display math

that is, the constraint amongst parameters was: inline image.

Additive Model

Additive model required that RR2 = 2RR1.

  • display math

hence, the constraint amongst parameters was: inline image.

Dominant Model

Dominant model required that RR1 = RR2.

  • display math

so the constraint amongst parameters was: inline image.

Recessive Model

Recessive model required that RR1 = 1.

  • display math

therefore, the constraint amongst parameters was: inline image.

Appendix II

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgement
  8. References
  9. Appendix I
  10. Appendix II
  11. Supporting Information

Variance of the U-EGS under the Assumption that All the Loci Are Independent

The statistic was inline image, where Ul was the U-statistic for locus l. Our aim is to calculate the variance of the statistic U, that is, inline image where inline image. Therefore, it was necessary to calculate the variances of the U-statistic for each locus and the covariances of the U-statistic between two loci. The calculation for cases or controls was similar, so here we omitted the superscript “c” or “d” for all notations.

To determine the variances and the covariances, we used standard results on U-statistics (see more details in Hoeffding, 1948; Mao et al., 1998). Before that, we need to review the notations. Consider one single locus, say locus l, whose allele frequency was q and 1−q. For brevity, omitting the subscript “l” of q, and we did not distinguish the frequencies and the probabilities (they used the same notations), because the frequencies were the MLE of the corresponding probabilities. All of the remaining calculations were similar, so we did not show these repeatedly.

Let inline image be a random variable, which denoted the allele at the lth locus of the ith haplotype, and let

  • math image

denote the kernel function of the lth locus (l = 1, 2,…, L; i, j = 1, 2,…, n; n = 2n1 for cases and n = 2n2 for controls). Define

  • display math

we could get inline image and inline image. So the expectation and variance were, respectively:

  • display math
  • display math

According to the standard results on U-statistics, we could obtain: inline image, that is, Vll = 4σll /n.

Next, we came to the following covariance Vst = Cov(Us,Ut). Assume that the unite distribution of locus s and t were inline image:

Locus sLocus tProbability
00q1
01q2
10q3
11q4

Hence we could get,

  • display math
  • display math

and let inline image.

Similarly, we could obtain, μs = −2A1(q1 + q2)(q3 + q4), μt = −2A2(q1 + q3)(q2 + q4). inline image. Therefore, σst = μstμsμt = A1A2(q1q4q2q3)(1−2q1−2q2)(1−2q1−2q3). According to the standard results on U-statistics, we could obtain: inline image, that is, Vst = 4σst /n.

Thus far, we obtained the variance of our test statistic under an independent model. However, the model followed by the loci could be any modest case that one could suppose, which we did not discuss in detail here. Results for the variance under the first Markovian model are listed in the Supporting Information. Furthermore, it is worth noting that the structure information of the loci should be known before using the U-EGS method.

Supporting Information

  1. Top of page
  2. Summary
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgement
  8. References
  9. Appendix I
  10. Appendix II
  11. Supporting Information

Disclaimer: Supplementary materials have been peer-reviewed but not copyedited.

FilenameFormatSizeDescription
ahg12049-sup-0001-supmat.doc470K

1 Variance of the U-EGS under 1st-order Markov Model

2 Other Simulation Results

ahg12049-sup-0002-supmat.doc308K

Table S1 Power to detect disease-marker associations under different models and prevalences (Kp). The sample size is 100.

Table S2 Power to detect disease-marker associations under different models and prevalences (Kp). The sample size is 200.

Table S3 Power to detect disease-marker associations under different models and prevalences (Kp). The sample size is 1000.

Table S4 Power to detect disease-marker associations under different models and prevalences (Kp). The sample size is 2000.

Table S5 Power to detect disease-marker associations under different models and methods. The sample size is 400 and the prevalence Kp is 0.02.

Table S6 Power to detect disease-marker associations under different models and methods. The sample size is 100 and the prevalence Kp is 0.02.

Figure S1 Power for identifying disease-marker associations using U-AM and U-EGS, when sample size is 400, and prevalence Kp = 0.02.

Figure S2 Power to detect disease-marker associations using U-AM and U-EGS, when sample size is 400, and prevalence Kp = 0.002.

Figure S3 Power to identify disease-marker associations using U-AM and U-EGS, when sample size is 400, and prevalence Kp = 0.0002.

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.