SEARCH

SEARCH BY CITATION

Keywords:

  • population stratification;
  • case-control study;
  • semi-parametric model;
  • smoothing method

Summary

  1. Top of page
  2. Summary
  3. Introduction
  4. Method
  5. Simulation Models
  6. Discussion
  7. Appendix
  8. Acknowledgments
  9. References

Recently, statistical methods have been proposed using genomic markers to control for population stratification in genetic association studies. However, these methods either have unacceptable low power when population stratification becomes strong or cannot control for population stratification well under admixture population models. In this paper, we propose a semiparametric association test to detect genetic association between a candidate marker and a qualitative trait of interest in case-control designs. The performanceof the test is compared to other existing methods through simulations. The results show that our method gives correct type I error rate both under discrete population models and admixture population models, and our method is robust to the extent of the population stratification. In most of the cases we considered, our method has higher power and, in some cases, substantially higher power than that of existing methods.


Introduction

  1. Top of page
  2. Summary
  3. Introduction
  4. Method
  5. Simulation Models
  6. Discussion
  7. Appendix
  8. Acknowledgments
  9. References

Recently, several methods have been proposed to utilize a set of independent markers, referred as genomic markers, to control for population stratification (Devlin & Roeder, 1999; Bacanu et al. 2000; Pritchard et al. 2000b; Devlin et al. 2001; Reich & Goldstein, 2001; Satten et al. 2001; Zhang & Zhao, 2001; Bacanu et al. 2002; Zhang et al. 2002; Zhu et al. 2002). These methods may have greater power and be easier to collect DNA sample than that of family-based association designs, and they may be also robust against potential population stratification. These methods can be broadly divided into two classes. The first class, called the GC (Genomic Control) method, consists of the methods proposed by Devlin & Roeder (1999), Bacanu et al. (2000), Devlin et al. (2001), and Reich & Goldstein (2001). They adjust the ordinary chi-square test statistic X2 to X2/λ, where λ can be estimated using genomic makers and it is assumed that X2 still has a χ2 distribution. If the sample comes from the population which consists of two subpopulations with different disease prevalences and different allele frequencies, as pointed out by Pritchard & Rosenberg (1999) and Zhang et al. (2003), E(X) ≠ 0 and, up to a constant, X2 will asymptotically follow a noncentral χ2 distribution χ2(δ). If the noncentral parameter δ is small, the χ2 distribution adjusted by a suitable constant can be a good approximation of χ2(δ). However, if δ is large, using a χ2 adjusted by a constant as an approximation of a noncentral chi-square χ2(δ) will either lead to false positives or lose power, though we do not know yet how large δ may be in practice. The second class of methods, including those proposed by Pritchard et al. (2000, 2001), Satten et al. (2001), Zhang et al. (2002a) and Zhu et al. (2002) for qualitative traits and Zhang & Zhao (2001) for quantitative traits, essentially use clustering methods to infer the population structure and incorporating this information into the association test. Although simulation results have found that these methods generally perform well under discrete subpopulation models, they may not be effective when the population under study is a mixture of ancestral populations. In this situation, inference of population structure (the number of ancestral populations and/or the probabilities that each individual belongs to every subpopulation) is very difficult, because virtually everybody in the sample is admixed, and there is little information about the ancestral populations (Pritchard et al. 2000b). Inaccurate estimate of the population structure will affect the validity of the test.

For a quantitative trait of interest, Zhang et al. (2003) proposed a Semi-Parametric Test of Association to avoid such aforementioned drawback. The test is derived by first estimating the genetic background variable value for each sampled individual using the principal components of many independent marker genotypes, and then modeling the relation of trait values, genotypic scores of candidate marker and the genetic background variable through a semi-parametric model. In this model, the trait value is treated as the dependent variable, genotypic scores as linear predictors and genetic background variable also as predictor but the relationship between the trait value and the genetic background variable is an unknown nonparametric function (possibly nonlinear). The simulation results indicate that this method possesses correct type-I error rate, and is more powerful than a TDT-like test in all cases considered.

In this article, we develop an analogous Qualitative Semi-Parameter Test (QualSPT) for the qualitative trait of interest. The method also hinges on finding the background variable of each sampled individual via the principal component procedure. We model the relationship between a disease status and genotypic scores and the genetic background variable through a semi-parametric logistic regression model. The model is semiparametric in the sense that the relation between the disease probability and the genetic background variable has an unknown form of function, that is, a nonparametric function, while the genotypic scores contribute to the logit disease probability in a linear function. The method is easy to include other variables. The estimation of the parameters in the linear function and the estimation of the nonparametric function can be carried out by the profile likelihood estimation method. To test if the given set of the independent markers can well control for population stratification, we perform the test to every independent marker and get a series of p-values, and then test if the p-values are uniformly distributed using the Kolmo-gorov test statistic. Uniformly distributed p-values imply that the population stratification is well controlled.

We evaluate the performance of the QualSPT through simulations both with discrete population models and admixture population models. The simulation results show that our procedure has correct type-I error rate in the presence of population stratification using a sufficient number of independent markers and is more powerful than other statistical association tests for family-based association designs (Spielman et al. 1993) using the same number of individuals. Furthermore, QualSPT is more powerful than STRAT (Pritchard et al. 2000b) when STRAT overestimates the number of subpopulations (i.e., ancestral populations). In addition, QualSPT is much more powerful than the GC method (Devlin & Roeder, 1999; Bacanu et al. 2000) when population stratification is strong.

Method

  1. Top of page
  2. Summary
  3. Introduction
  4. Method
  5. Simulation Models
  6. Discussion
  7. Appendix
  8. Acknowledgments
  9. References

Notation and Statistical Model

Let X denote a vector of the numerical codes for the candidate locus genotype, g (see section “genotype scoring” for detail). Let y denote the trait value. For the case-control design considered in this article, y= 1 denotes disease and y= 0 denotes normal. Let (y1, X1), … , (yn, Xn) denote the trait values and genotypic codes of n individuals. For a homogeneous population, genetic association between the qualitative trait and a candidate marker can be studied through the following logistic regression model:

  • image(1)

where β represents the effect of the genotype on the trait and μ is the intercept. We call μ the logistic phenotypic mean in the following discussion. Under this model, the log-likelihood function for given Xi is given by

  • image

The maximum likelihood estimate of β is asymptotically unbiased and the likelihood ratio test can be used to test the null hypothesis of no association, H0: β= 0. The advantage of this model is that it is straightforward to include environmental variables and to extend to multiple loci and gene-gene interactions.

In the presence of population stratification, however, model (1) may be invalid. Specifically, if the individuals come from different subpopulations with different μ in model (1) and different allele frequencies, then inline image will not be an unbiased estimator of β (see Appendix). The test mentioned above will lead to false-positives due to population stratification. Similar problem occurs if the sampled individuals come from an admixture population. Consequently, model (1) will not be valid to model the relationship between the trait and the marker genotype. In a non-homogeneous population, μ may vary across different genetic background. Zhang et al. (2003) introduced a genetic background variable t (may be multi-dimensional) that can characterize the difference among the individuals with different genetic backgrounds and estimate the genetic background variable t using the principal components of the genotypes of many independent markers across genome. In this article, we use the same method as that proposed by Zhang et al. (2003) to estimate the genetic background variable t. Briefly, let xij denote a column vector of the numerical codes of the ith individual at the jth maker and xi= (xi1, xi2,…xiL)′ be the numerical code vector of the L-marker genotype of the ith individuals, where xij denotes the transpose of vector xij. If the jth marker is a biallelic marker with allele A1 and A2, we can code xij= 0, 1 and 2 for genotype A1A1, A1A2 and A2A2, respectively. For the marker with more than two alleles, the detail of coding xij can be found in Zhang et al. (2003). The principal component technique is used to reduce the dimension of the data. Let Σ denote the sample variance-covariance matrix of the sample x1, x2, … , xn and ej denote the eigenvector corresponding to the jth largest eigenvalue of Σ. We can then calculate the jth principal component of the ith individual as tij = xiej. Let ti= (ti1, ti2, … , tiM) denote the first M principal components of the ith individual. We use ti as the estimated value of the genetic background variable of the ith individual. We will discuss how to choose M later.

To take the genetic background effect into account, we assume the following semi-parametric model or logistic partial linear model:

  • image(2)

where μ(t) is a unknown smooth function of the genetic background t and is not parametrized. In this article, a smooth function means that the function has continuous derivative. The assumption that the function μ(·) is smooth is based on the consideration that similar genetic backgrounds should lead to similar phenotypic means. For the sake of simplicity, we do not include any covariates in model (2), but this is not a necessary restriction.

The log-likelihood score of the ith individual for a given genotype code Xi and genetic background variable ti is given by

  • image

The log-likelihood function for the data of n individuals is given by

  • image

Estimation and Test of Association

The estimation of parameter β and non-parametric function μ(t) under logistic partial linear models has been developed recently. Several methods have been proposed in the statistics literature, for example, seeSeverini & Wong (1992), Severini & Staniswalis (1994), and Carroll et al. (1997). Here we follow the approach of Severini & Staniswalis (1994), which uses two different likelihood functions, likelihood function l(β, μ) and a local likelihood function. The local likelihood function around point t is defined as

  • image(3)

where t is an M dimensional real vector; h is a parameter called smoothing parameter; the function K(·) is a real function on RM called kernel function with the properties that K(·) reaches its maximum value at origin. Although different kernels have little effect on the estimation, the effects of the smoothing parameter h can be strong (Hart, 1997; Simonoff, 1996). In this paper, we use the quadratic kernel K(z) =ΠMi=1k(zi) with

  • image

and where z= (z1, z2, … , zM)′. At present stage, we assume the value of smoothing parameter h is known. The method of choosing smoothing parameter h will be discussed in later section.

The estimation procedure has several steps. First, for t=tj, maximize the local log-likelihood function (3) with respect to η, assuming a fixed β. The maximizer inline image is an estimator of μ(tj) for j= 1, 2, …n. The values inline image are then used to estimate β by maximizing the log-likelihood function with respect to β,

  • image(4)

Although there are no explicit solutions from (3) and (4), the estimation problem can be solved iteratively as follows:

Step 1. Solve the equation of η

  • image(5)

Denote inline image be the solutions of η for t=t1, t=t2, … , t=tn, respectively. Here βm is the current estimated value of β.

Step 2. Solve the equation of β

  • image(6)

This yields the new parameter estimate βm+1.

We repeat this two-step process until convergence occurs.

In the appendix, we show that inline image the estimator of β, is an asymptotically unbiased estimate of β under model (2). Therefore, under the null hypothesis of no association, inline image Our QualSPT test can be constructed based on the likelihood ratio test using the estimates calculated from the aforementioned procedure, and the test statistic has a χ2 distribution with degrees of freedom equal to the dimension of the numerical vector of the candidate locus genotype.

Choosing Smoothing Parameter and the Number of Principal Components

The test procedure mentioned above assumes a given smoothing parameter h. However, the effect of the smoothing parameter h can be strong. We follow the method proposed in Zhang et al. (2003) to choose the smoothing parameter h. Briefly, for any givenvalue of h, let p1(h), … , pL(h) be the p-value when we perform QualSPT for all the independent markers using the given value of h. We denote the empirical distribution of the p-value based on the sample p1(h), … , pL(h) by Fn(x, h). The Kolmogorov test statistic is defined as M(h) = maxx|Fn(x, h) −F(x)|, with F being the uniform distribution function. The method is to choose the smoothing parameter h* such that it minimizes the Kolmogorov test statistic

  • image

The rationale of the method is that for independent markers, the p-values should follow a uniform distribution if population stratification is well controlled for. The purpose of using independent markers is to control population stratification. Therefore, the smoothing parameter h is chosen to minimize the difference between the empirical distribution and the uniform distribution. This procedure also provides a statistical test to assess if population stratification can be well controlled for by using the set of independent markers. If the p-value of the test statistic M(h*) is greater than a specified significance level, e.g. 0.05, we may consider that the population stratification has been well controlled for. For the 0.05 significance level, the test statistic is not significant i.e. the population stratification is reasonably controlled, if inline image (see Nguyan & Roger, 1989, p. 373). Pritchard et al. (2000b) also suggested this idea to estimate the number of subpopulations used in their test STRAT.

For a given data set, another question is that how many principal components we should use. Our suggestion is that, first we use only the first principal component. If the Kolmogorov test shows that the population stratification can not be well controlled for, we will use more principal components. Further discussion about how many principal components we should use will be given in the Discussion section.

Genotype Scoring

The candidate locus genotype can be scored in a number of ways. A simple scheme is to count the number of alleles that an individual possesses at the candidate locus. If there are m alleles, A1, A2, … , Am, at the candidate locus, this scheme is to create a numerical vector X= (x1, … , xm−1) for the genotype, where xi is the number of allele Ai in the genotype (i= 1, 2, … , m− 1). This scheme only accounts for the additive effect of the alleles and will lead to a χ2 distribution with the degrees of freedom m− 1 of the statistic of QualSPT. Another scheme is as follows: Let G1,G2, … ,Gm(m+1)/2 denote the total possible genotypes of the m alleles. Define X= (x1, … , xm(m+1)/2−1) as the numerical vector of genotype G, where

  • image

This scheme will account for both the additive and dominant effect of the alleles. However, it makes the degrees of freedom much larger, i.e. m(m+ 1)/2 − 1.

For a biallelic locus, m= 2, we usually use the second scheme to score the genotype and this is what we use in our simulations. If m is large, the first scheme may be more powerful. However, if we know some prior information of the heredity model, it will be helpful for the scoring scheme. For example, for a recessive disease, we may score xi to be 1 for genotype AiAi and 0 for all other genotypes.

Simulation Models

  1. Top of page
  2. Summary
  3. Introduction
  4. Method
  5. Simulation Models
  6. Discussion
  7. Appendix
  8. Acknowledgments
  9. References

In this section, we discuss the simulation models used to assess whether QualSPT is robust to population stratification and to compare the power of QualSPT with other association tests. In order to compare the performance of QualSPT with that of the Genome Control method (GC) developed by Devin et al. (1999) and Bacanu et al. (2000), we only consider biallelic markers in our simulations, because the GC method is only applicable to biallelic markers. In our simulation studies, we either generate the data through discrete subpopulation models or continuous admixture population models. Other parameters varied in our simulations include different modes of inheritance and different prevalences among the subpopulations.

Discrete Subpopulation Models

We use empirical population genetics data from a population genetics database ALFRED (Osier et al. (2001); http//info.med.yale.edu/genetics/kkidd) that provides allele frequencies for both SNPs and microsatellite markers in different populations. For our simulation purposes, we extract 100 markers across four populations, including Danes, San Francisco Chinese, Maya and Biaka. For microsatellite markers, because we focus on the use of SNP markers in our simulation, we pool the alleles to form biallelic markers with allele frequencies between 10% and 90%. We consider different numbers of markers, 100, 200, … , 500 by using the 100 markers multiple times to infer the genetic background variable.

Let fi denote the probability that an affected individual is sampled from the ith subpopulation, gi denote the probability that a normal individual is sampled from the ith subpopulation, and Pi be the prevalence of the disease in the ith subpopulation. For a rare disease, inline image (Pritchard & Rosenberg, 1999; Zhang et al. 2002). We sample 50, 15, 15 and 20 normal individuals from Danes, San Francisco Chinese, Maya and Biaka, respectively. We consider two cases of relative prevalences P1:P2:P3:P4= 1:2:3:4 and P1:P2:P3:P4= 1:4:6:8 which correspond to sample 24, 14, 23, 39 and 13, 16, 27, 44 diseased individuals from the four subpopulations, respectively.

In our assessment of whether QualSPT is robust to population stratification, we independently generate marker data 10 times for every marker among the 100 markers mentioned above, i.e. we generate 10 × 100 = 1000 markers that have no association with disease phenotype. For each case of relative prevalences, we perform statistical tests for each of the 1000 markers.

To compare the power of QualSPT with other statistical tests, we generate 1000 data set. For each data set, the genotypes for the trait locus are resimulated using the marker allele frequencies of each of the 100 loci in turn. Let A and a denote the two alleles and f11, f12, and f22 denote the penetrances for genotypes AA, Aa, and aa, respectively (f12=f11 or f22 corresponds to a dominant or recessive disease model). Let the relative risk RA=f11/f22. For a given RA value and the mode of inheritance, the proportions of affected individuals with genotypes AA, Aa and aa can be easily calculated. Let Rmath image,Rmath image,Rmath image, and Rmath imagedenote the relative risk in Danes, San Francisco Chinese, Maya, and Biaka, respectively. In our simulations, we vary the values of Rmath image(i= 1, 2, 3, 4) and disease models.

Admixture Populations

We assume that the population under study is the admixture of two ancestral populations. In our simulations, we use Danes and Biaka as the two ancestral populations to represent Europeans and Africans, and extract the allele frequencies of 100 biallelic markers as mentioned above from ALFRED and repeat these 100 markers if we need more marker data. We simulate data sets using similar model to model C in Pritchard et al. (2000b). For each individual, we first simulate q, where q is the fraction of European ancestry and 1 −q the fraction of African ancestry. Then, at each locus, two alleles are drawn independently with probability q from Danes, and probability 1 −q from Biaka allele-frequency distributions.

Model A. we assume that q is uniformly distributed in interval (0, 1). Normal individuals are sampled from this distribution. In our assessment of type-I errors, 100 normal individuals and 100 diseased individuals are sampled. In order to simulate diseased individuals, we assume that the prevalence of the disease is eight-fold higher in Danes than in Biaka. The rejection sampling as described in Pritchard et al. (2000b) is used to simulate the diseased individuals.

To compare the power, let Rmath image denote the relative risk of individual i with genotype gi. We consider two same type alleles as different alleles if they come from different ancestral subpopulations. For example, we use A1 and A2 to denote the A allele which is transmitted from Danes and Biaka, respectively. So, the relative risk Rmath image depends on both the genetic background qi and the genotype at the candidate locus. We then simulate the genotypes of the diseased individuals using the probability

  • image

For the details, see Pritchard et al. (2000b). In our simulations, we let Rmath image=Rmath image=Rmath image= 1 and varied Rmath image and Rmath image and set Rmath image= (Rmath image+Rmath image)/2. The relative risk of other genotypes is calculated according to different disease models. For example, Rmath image= (Rmath image+Rmath image)/2 under the additive disease model.

Model B. Instead of using the uniform distribution to generate q in model A, we generate q from beta distribution B(2, 6) for normal individuals and from B(α, 2) for diseased individuals. This means that, on average, 1/4 of the genetic materials are from Danes and 3/4 are from Biaka for normal individuals. For diseased individuals, on average, α/(α+ 2) of the genetic materials are from Danes and 2/(α+ 2) are from Biaka. We use either α= 2 or α= 4 in our simulations.

For the above simulations, we assume that all the subpopulations have the same high risk allele. We also conduct another set of simulations for power comparison under admixture model B and α= 2, where we randomly assign high risk allele independently in each of the ancestral population according to the allele frequency. For example, consider a candidate locus with two alleles A and a. The allele frequencies of allele A in two ancestral populations are p1 and p2, respectively. Then, for each of the 1000 replicated data sets, A is assigned as the high risk allele with probabilities p1 and p2 in the two ancestral populations, respectively.

Other Association Tests Considered

In addition to QualSPT, we also consider several other association tests in our simulations. The first test is the χ2 test that ignores potential population stratification. The second test is the GC (Genomic Control) method developed by Devlin & Roeder (1999) and Bacanu et al. (2000), which uses the test statistic χ2 and the parameter λ is estimated by inline imagemedian21, χ22, … , χ2L)/0.456, where χ2i is the value of χ2 test statistic for the ith independent marker. The third test is STRAT proposed by Pritchard et al. (2000b). To perform the test of STRAT, we first use Structure (Pritchard et al. 2000a) to estimate the probabilities that each individual belongs to each subpopulation, under the assumption that the number of subpopulations is known. Using either discrete subpopulation models or admixture population models, we also simulate a set of family triads and apply a TDT test proposed by Spielman et al. (1993) to determine whether there is an association between the marker and the trait. We denote this test by TDT. In power comparisons, we simulate 2n/3 and n triads, respectively, in the family-based association design, where n is the total number of diseased individuals in the sample of unrelated individuals. The reason that we cover a range of sample sizes in the power comparisons is that the amount of phenotyping and genotyping is different between the two designs for the same number of individuals. Therefore, it is difficult to select a fixed sample size to make the comparison fair. For each simulation model, we first generate 2n/3 and n diseased individuals, respectively, in the total population as children, and then generated their parents' genotypes. The p-values of the TDT are also evaluated by the simulations.

Results

Test whether population stratification is reasonably controlled for: The first step of QualSPT is to evaluate whether the population stratification can be well controlled by a given set of genetic markers. We begin with the use of the first principal component. If the Kolmogorov test indicates that the population stratification can not be well controlled for, we will use the first two or three principal components (see how to choose the number of principal components in Discussion). Figure 1 summarizes the test statistic inline image corresponding to different numbers of genetic markers under three population models. Under discrete population model and when population stratification is strong (P1:P2:P3:P4= 1:4:6:8), the Kolmogorov test shows that the population stratification can not be well controlled only using the first principal component and can be well controlled by using the first two principal components (Figure 1(a) and (b)). For almost of all the cases under admixture model A and B, Kolmogorov test shows that the population stratification can be well controlled using the first principal component (Figure 1(c) and (d)). This observation is consistent with the results given in Figure 2, 3 and 4. where we compare the type-I error rates of various statistical tests using different numbers of independent markers. These results suggest that the Kolmogorov test described above has good utility in the determination of whether a set of genomic markers can control for population stratification and in choosing the number of principal components for a given set of independent genetic markers. Under discrete population models, we include the type-I error results using the first principal component and the first two principal components in order to evaluate the performance of the Kolmogorov test. The final results of QualSPT are obtained using the first principal component for the case of P1:P2:P3:P4= 1:2:3:4 and the first two principal components for the case of P1:P2:P3:P4= 1:4:6:8. In the following discussion, we only compare the final results of QualSPT with the results of other tests.

image

Figure 1. The number of independent markers versus the value of the test statistic inline image under the null hypothesis of no association. The dashed line is the critical value for the significance level of 5%. In figures (a) and (b), the solid line and the dotted line denote the results of the cases P1:P2:P3:P4= 1:2:3:4 and P1:P2:P3:P4= 1:4:6:8, respectively. In figure (d), the solid line and the dotted line denote the results of α= 2 and α= 4, respectively, under admixture population model B.

Download figure to PowerPoint

image

Figure 2. Type-I error comparisons of the four tests at the nominal value of 5% under discrete population models. The sample consists of 100 diseased individuals and 100 normal individuals. The two straight solid lines around 0.05 (type-I error) form the 95% confidence interval of the type-I error rate under the null hypothesis.

Download figure to PowerPoint

image

Figure 3. Type-I error comparisons of the four tests at the nominal value of 5% under admixture population model A. The sample size is 100 diseased individuals and 100 normal individuals. The two straight solid lines around 0.05 (type-I error) form the 95% confidence interval of the type-I error rate under the null hypothesis.

Download figure to PowerPoint

image

Figure 4. Type-I error comparisons of the four tests at the nominal value of 5% under admixture population model B. The sample size is 100 diseased individuals and 100 normal individuals. The two straight solid lines around 0.05 (type-I error) form the 95% confidence interval of the type-I error rate under the null hypothesis.

Download figure to PowerPoint

Type-I error rates: It is well known that TDT is robust to population stratification. For our type-I error evaluations, we only consider the four tests, χ2, QualSPT, STRAT and GC, which are based on unrelated samples. Figures 2, 3 and 4 summarize the type-I error rates for the four test statistics by using different numbers of markers in simulations through the discrete population models, admixture population model (A) and (B), respectively. The results are based on 1000 replications (10 replications for each of 100 markers, as if there were 10 × 100 = 1000 replications) with each replication consisting of 100 diseased individuals and 100 normal individuals for all four tests. A total of 1000 simulated data sets are used for each sample in the estimation of the p-values. Therefore, for the statistical significance level of 0.05, the standard error for the type-I error rate estimate is inline image and the 95% confidence interval of the type-I error is (0.036, 0.064). It is apparent from the figures that the χ2 test, which ignores potential population stratification, may have type-I error rate that is substantially higher than the nominal level in the presence of population stratification (in all the cases considered here). Under the discrete population models (Figure 2), the type-I errors of both QualSPT (using the first principal component for the case of P1:P2:P3:P4= 1:2:3:4 and using the first two principal components for the case P1:P2:P3: P4= 1:4:6:8) and STRAT are within the 95% confidence interval of the nominal type-I error rate using independent markers from 100 to 500. Although the type-I error rates of the GC are much smaller than that of χ2 test, there are several cases in which the type-I error rates of GC test are beyond the boundaries of the 95% confidence interval and these cases seem independent of the number of the markers used to control for population stratification. Population stratification under the admixture model A (Figure 3) is not as strong as other models: the type-I error rates of χ2 test are from 10% to 12% for the nominal level 5%. Under this model, the type-I error rate of QualSPT and GC tests (except using 200 markers) are within the 95% confidence interval. The type-I error rate of STRAT is slightly higher than the upper boundary of 95% confidence interval even as many as 500 markers are used. The population stratification under admixture model B is much stronger than that under admixture population model A. The type-I error rates of χ2 test are around 24% and 40% for α= 2 and α= 4, respectively. Under this model, the type-I errors of QualSPT are within the 95% confidence interval of the nominal type-I error rate using 200 independent markers or more; except a few cases, the type-I errors of GC test are also within the 95% confidence interval of the nominal type-I error rate; though the type-I error rate of STRAT decreases with the increasing of the number of markers used to control for population stratification, this error rate seems to be stable at 7 ∼ 8% using 400 independent markers or more.

In summary, the type-I errors of QualSPT are within the 95% confidence interval of the nominal type-I error rate using 200 markers or more for all the cases we considered. The type-I errors of GC are around the nominal level, though there are several cases in which the type-I error rates of GC test are beyond the boundaries of the 95% confidence interval. Under discrete population models, the type-I errors of STRAT are within the 95% confidence interval of the nominal type-I error rate using 100 independent markers or more. Under admixture models, the type-I error rate of STRAT decrease with the number of independent markers used to control for population stratification. However, when the number of independent markers is more than 300, type-I error rates of STRAT decrease very slowly and are still beyond the upper boundary of the 95% confidence interval even using 500 markers.

Power Comparisons: In this set of simulations, we compare the power of the four tests, QualSPT, STRAT, GC and TDT. The results are based on 1000 replications with each replication consisting of n= 100 diseased individuals and 100 normal individuals for QualSPT, STRAT and GC, and 2n/3 and n triads for TDT test. A total of 300 markers are used by QualSPT, STRAT and GC to control for population stratification. For almost all the cases including different population models and different disease models we considered, QualSPT is more powerful than TDT when TDT uses 2n/3 triads, and QualSPT is less powerful than TDT when TDT uses n triads. The power comparisons of the three tests, QualSPT, STRAT and GC, are more complicated.

The results of our power comparisons under discrete population model and the assumption of the same high risk allele in all the subpopulations are summarized in Table 1. For almost all the cases in this set of simulations, QualSPT is more powerful than STRAT. The power of GC is substantially lower than that of both QualSPT and STRAT, especially when population stratification becomes stronger (P1:P2:P3:P4= 1:4:6:8).

Table 1.  Power comparisons of the four tests under discrete subpopulation models and the assumption of same high risk allele in all the subpopulations. The sample size is n= 100 diseased and 100 normal individuals for QualSPT, SRAT, and GC. The sample size is 2n/3 and n triads for TDT. Pi is the relative disease prevalence of the ith subpopulation and Rmath image is the relative risk of genotype AA in the ith subpopulation (i= 1, 2, 3, 4)
P1:P2:P4:P4 Rmath image,Rmath image,Rmath image,Rmath imageDisease ModelP= 0.05P= 0.01
QualSPTSTRATGCinline imageQualSPTSTRATGCinline image
1:2:3:4
1,2,3,4Domi.0.420.330.210.34 0.480.240.150.060.15 0.27
 Add.0.280.300.280.32 0.460.100.110.100.13 0.23
 Rec.0.350.300.230.34 0.430.180.160.100.17 0.26
2,4,6,8Domi.0.860.760.580.73 0.840.750.600.300.52 0.72
 Add.0.790.650.590.74 0.870.590.450.310.52 0.73
 Rec.0.780.710.590.71 0.800.650.570.420.53 0.70
1:4:6:8
1,4,6,8Domi.0.880.640.520.70 0.830.770.460.270.50 0.70
 Add.0.770.700.470.72 0.870.610.530.240.52 0.75
 Rec.0.800.730.510.72 0.810.680.600.340.56 0.69
2,8,12,16Domi.0.970.800.780.84 0.900.940.650.610.71 0.84
 Add.0.950.900.650.90 0.960.890.790.360.80 0.91
 Rec.0.960.900.780.90 0.940.900.830.680.84 0.90

Under admixture population model A and the assumption of the same high risk allele in two ancestral populations, the results of power comparisons are given in Table 2. Under this population model, the population stratification is not strong. The power of the three tests, QualSPT, STRAT and GC, are similar.

Table 2.  Power comparisons of four tests under admixture population model A and the assumption of the same high risk allele in the two ancestral populations. The sample size is n= 100 diseased and 100 normal individuals for QualSPT, SRAT, and GC. The sample size is 2n/3 and n triads for TDT. Rmath imageAi is the relative risk of genotype AA in the ith ancestral population (i= 1, 2)
Rmath image,Rmath imageDisease ModelP= 0.05P= 0.01
QualSPTSTRATGCinline imageQualSPTSTRATGCinline image
2,2Domi.0.660.670.630.57 0.720.440.430.400.36 0.51
 Add.0.600.650.640.55 0.700.390.440.410.35 0.48
 Rec.0.580.600.610.52 0.630.340.390.370.33 0.44
4,4Domi0.930.920.890.88 0.950.880.830.740.77 0.89
 Add.0.930.920.790.90 0.960.850.840.590.76 0.90
 Rec.0.930.920.800.89 0.940.830.820.640.76 0.89
2,4Domi0.870.860.810.82 0.910.750.730.690.64 0.80
 Add.0.870.890.830.82 0.900.740.750.710.67 0.81
 Rec.0.840.840.740.80 0.880.710.720.590.63 0.79
4,8Domi.0.960.940.950.94 0.970.940.900.900.87 0.94
 Add.0.970.950.930.95 0.980.940.910.870.89 0.95
 Rec.0.960.950.940.94 0.980.920.920.910.88 0.95

The results of power comparisons under admixture population model B and the assumption of the same high risk allele in two ancestral populations are summarized in Table 3. The results show that if the relative risk is high (Rmath image= 2 and Rmath image= 4), QualSPT is more powerful than STRAT. If the relative risk is low (Rmath image = 1 and Rmath image= 2), QualSPT and STRAT have similar power. The power of GC is much less than that of both QualSPT and STRAT, especially for the case of strong population stratification i.e. α= 4. For significant level P= 0.01, the power of QualSPT is 2 to 3 times as that of GC for α= 2, and the power of QualSPT is 4 to 5 times as that of GC for α= 4.

Table 3.  Power comparisons of four tests under admixture population model B and the assumption of the same high risk allele in the two ancestral populations. The sample size is n= 100 diseased and 100 normal individuals for QualSPT, SRAT, and GC. The sample size is 2n/3 and n triads for TDT. Rmath image is the relative risk of genotype AA in the ith ancestral population (i= 1, 2). α is the parameter in the Gamma distribution as described in the text
α Rmath image,Rmath imageDisease ModelP= 0.05P= 0.01
QualSPTSTRATGCinline imageQualSPTSTRATGCinline image
2
1,2Domi.0.550.570.290.55 0.700.330.330.110.33 0.50
 Add.0.560.550.230.50 0.680.330.330.070.27 0.44
 Rec.0.560.580.340.49 0.640.330.370.130.29 0.45
2,4Domi.0.940.890.710.87 0.950.850.760.450.74 0.89
 Add.0.920.880.500.86 0.940.790.750.200.70 0.88
 Rec.0.890.870.620.85 0.920.760.740.310.70 0.85
4
1,2Domi.0.380.370.110.40 0.600.160.170.020.17 0.30
 Add.0.300.310.100.37 0.540.140.140.020.16 0.32
 Rec.0.280.300.100.32 0.480.140.150.030.17 0.30
2,4Domi.0.820.740.360.81 0.900.630.530.150.60 0.80
 Add.0.800.750.360.76 0.900.570.540.140.59 0.76
 Rec.0.740.700.270.75 0.900.480.450.120.53 0.68

The results of power comparisons under assumption of random high risk allele are summarized in Table 4. In this set of simulations, STRAT is slightly more powerful than QualSPT and both STRAT and QualSPT are much more powerful than GC.

Table 4.  Power comparisons of four tests under admixture population model B and the assumption of random high risk alleles in the two ancestral populations. The sample size is n= 100 diseased and 100 normal individuals for QualSPT, SRAT, and GC. The sample size is 2n/3 and n triads for TDT. Rmath image is the relative risk of genotype AA in the ith ancestral population (i= 1, 2). Here, allele A denotes the high risk allele which may be different in different ancestral populations
Rmath image,Rmath imageDisease ModelP= 0.05P= 0.01
QualSPTSTRATGCinline imageQualSPTSTRATGCinline image
1,2Domi.0.550.710.190.31 0.410.330.580.050.14 0.25
 Add.0.560.510.100.30 0.410.330.330.040.13 0.21
 Rec.0.560.420.120.26 0.340.330.230.030.13 0.18
2,4Domi.0.940.740.300.40 0.600.850.630.140.23 0.41
 Add.0.920.660.230.50 0.630.790.480.100.31 0.43
 Rec.0.890.890.330.59 0.690.760.760.160.39 0.52

Discussion

  1. Top of page
  2. Summary
  3. Introduction
  4. Method
  5. Simulation Models
  6. Discussion
  7. Appendix
  8. Acknowledgments
  9. References

Recently, several statistical methods have been proposed to use genomic markers to control for population stratification in the analysis of population-based data. These approaches are promising because they may both have greater power than family-based association designs and they may be robust against potential population stratification. As described in the introduction, these general methods can be broadly divided into two classes. The first class, the GC method, may lose power when population stratification is strong. The second class, the SA method, may have difficulty to estimating the population structure under admixture population and the inaccurate estimate of the population structure will affect the validity of the test.

More recently, Zhu et al. (2002) proposed a mixture model using principal components to test association between a marker and a qualitative trait. Zhang et al. (2003) further developed a semi-parametric test of association based on partial linear model to detect associations between candidate markers and quantitative traits using population-based data. They have shown through simulations that the test has correct type-I error rate under both discrete subpopulation models and admixture population models.

In this article, using the idea similar to that of Zhang et al. (2003), we have developed a semi-parametric test of association, QualSPT, based on a logistic partial linear model to detect the association between a candidate marker and a qualitative trait using the case-control design. We have compared the performance of our test with that of the family-based TDT method, the GC method and STRAT, a SA method proposed by Pritchard et al. (2002b). Our simulation results show that QualSPT has correct type-I error rate under both discrete subpopulation models and admixture population models. QualSPT is more powerful than family-based TDT test when the two tests use the same sample size (sample number of individuals). QualSPT is much more powerful than GC when population stratification is strong. Compared with STRAT, QualSPT has correct type-I error rate using 200 or more independent marker to control for population stratification; whereas, under admixture population models and the assumption that the number of ancestral populations is known, STRAT may still show an excess of false-positive results using 500 independent markers to control for population stratification. For most cases, QualSPT is also more powerful than STRAT when all subpopulations have the same high risk allele. Moreover, it is straightforward to include covariates in QualSPT. If different subpopulations may have different high risk alleles, STRAT is more powerful than QualSPT. However, as argued by Bourgain et al. (2000), there may be different high risk alleles for different individuals even in the same subpopulation. If this is the case, all the tests considered in this article will lose power. In this case, the methods based on multi-marker haplotypes may be more appropriate and powerful. The methods based on multi-marker haplotypes need further investigations.

Although we have compared the power of QualSPT with that of TDT with two different sample sizes (the same number of total individuals used and same number of diseased individuals used), the comparisons are based on the assumption that a set of independent markers are available to estimate the genetic background variables. If there is only one candidate locus, QualSPT may require substantially more genotyping efforts. However, given the low prior probability for a specific gene to be involved for a complex trait and the ever-decreasing genotyping cost, it may be more cost effective to perform a population-based study.

Another question is how many principal components we should use for a given data set. Generally speaking, more principal components will have more information. However, the first principal component will summarize most of the data information followed by the second, the third, and other principal component. After the first few principal components, additional principal components have little information but make the model more complex and bring more uncertainty (need to estimate more parameters). Our suggestion is that we first use the first principal component. If the Kolmogorov test indicates that it can not control for the population stratification, we will use the first two or three principal components. In general, for most of the population genetic data we have analyzed, the first three principal components account for the majority of the genetic variations observed in the data.

Appendix

  1. Top of page
  2. Summary
  3. Introduction
  4. Method
  5. Simulation Models
  6. Discussion
  7. Appendix
  8. Acknowledgments
  9. References
A.1. The expectation of β under Model 1

For simplicity, we consider the case that the population under study consists of two subpopulations and the candidate locus is biallelic with allele A and a. The numerical codes x= 1, 0 and −1 represent genotypes AA, Aa and aa, respectively. This coding scheme is equivalent to the first scheme as described in the section “Genotype Scoring”. Let yij and xij denote the disease status and numerical code of the genotype of the jth individuals in the ith subpopulation, respectively. If different subpopulations have different logistic phenotype means, the model is given by

  • image((A1))

for j= 1, … , ni, i= 1, 2. Suppose we ignore the population structure, by assuming the following model

  • image((A2))

for j= 1, … , ni, i= 1, 2. If we estimate the parameters under Model (A2), the maximum likelihood estimator of μ and β are obtained by maximizing the log-likelihood function

  • image

Therefore, inline image and inline image satisfy

  • image

We can prove that, through some calculation, if inline image as n=n1+n2[RIGHTWARDS ARROW], then either μ12 or p1=p2, where p1 and p2 are the frequencies of allele A in first subpopulation and second subpopulation, respectively. Thus, we have the following:

Proposition 1 Under structured population, i.e. model(A1), ifμ1≠μ2andp1p2≠ 0, theninline imageasn=n1+n2[RIGHTWARDS ARROW].

The proposition implies that inline image even when the sample size becomes large unless the two populations have the same logistic phenotype mean μ or the same allele frequency.

A.2. Asymptotic unbiasedness of inline image

The estimate inline image for model (2) is asymptotically unbiased if the bandwidth h in the kernel is of order O(n−α) with 0 < α < 1/4. The sketch of the proof is as follows.

Using the notation in the text and denotes the log-likelihood function by

  • image

Based on the local log-likelihood (3), if μ(t) is a smooth function, i.e., the derivative μ′(t) is continuous, and the bandwidth h satisfies h=O(n−α) with 0 < α < 1/4, then there exists a function μβ(t) such that inline image in probability (Severini & Wong, 1992). Through some standard arguments from profile likelihood estimation, we can show that inline image has a limiting normal distribution N(0, [i(β)]−1), where,

  • image

and that inline image is asymptotically unbiased.

Acknowledgments

  1. Top of page
  2. Summary
  3. Introduction
  4. Method
  5. Simulation Models
  6. Discussion
  7. Appendix
  8. Acknowledgments
  9. References

We thank Dr. Kenneth K. Kidd for our access to the ALFRED population genetics database. This work was supported in part by grant GM59507 from the National Institutes of Health and NNSF of China No 10071011.

References

  1. Top of page
  2. Summary
  3. Introduction
  4. Method
  5. Simulation Models
  6. Discussion
  7. Appendix
  8. Acknowledgments
  9. References