In this section, we discuss the simulation models used to assess whether QualSPT is robust to population stratification and to compare the power of QualSPT with other association tests. In order to compare the performance of QualSPT with that of the Genome Control method (GC) developed by Devin *et al.* (1999) and Bacanu *et al.* (2000), we only consider biallelic markers in our simulations, because the GC method is only applicable to biallelic markers. In our simulation studies, we either generate the data through discrete subpopulation models or continuous admixture population models. Other parameters varied in our simulations include different modes of inheritance and different prevalences among the subpopulations.

#### Discrete Subpopulation Models

We use empirical population genetics data from a population genetics database ALFRED (Osier *et al.* (2001); http//info.med.yale.edu/genetics/kkidd) that provides allele frequencies for both SNPs and microsatellite markers in different populations. For our simulation purposes, we extract 100 markers across four populations, including Danes, San Francisco Chinese, Maya and Biaka. For microsatellite markers, because we focus on the use of SNP markers in our simulation, we pool the alleles to form biallelic markers with allele frequencies between 10% and 90%. We consider different numbers of markers, 100, 200, … , 500 by using the 100 markers multiple times to infer the genetic background variable.

Let *f*_{i} denote the probability that an affected individual is sampled from the *i*th subpopulation, *g*_{i} denote the probability that a normal individual is sampled from the *i*th subpopulation, and *P*_{i} be the prevalence of the disease in the *i*th subpopulation. For a rare disease, (Pritchard & Rosenberg, 1999; Zhang *et al.* 2002). We sample 50, 15, 15 and 20 normal individuals from Danes, San Francisco Chinese, Maya and Biaka, respectively. We consider two cases of relative prevalences *P*_{1}:*P*_{2}:*P*_{3}:*P*_{4}= 1:2:3:4 and *P*_{1}:*P*_{2}:*P*_{3}:*P*_{4}= 1:4:6:8 which correspond to sample 24, 14, 23, 39 and 13, 16, 27, 44 diseased individuals from the four subpopulations, respectively.

In our assessment of whether QualSPT is robust to population stratification, we independently generate marker data 10 times for every marker among the 100 markers mentioned above, i.e. we generate 10 × 100 = 1000 markers that have no association with disease phenotype. For each case of relative prevalences*,* we perform statistical tests for each of the 1000 markers.

To compare the power of QualSPT with other statistical tests, we generate 1000 data set. For each data set, the genotypes for the trait locus are resimulated using the marker allele frequencies of each of the 100 loci in turn. Let *A* and *a* denote the two alleles and *f*_{11}, *f*_{12}, and *f*_{22} denote the penetrances for genotypes *AA*, *Aa*, and *aa*, respectively (*f*_{12}=*f*_{11} or *f*_{22} corresponds to a dominant or recessive disease model). Let the relative risk *R*_{A}=*f*_{11}/*f*_{22}. For a given *R*_{A} value and the mode of inheritance, the proportions of affected individuals with genotypes *AA*, *Aa* and *aa* can be easily calculated. Let *R*,*R*,*R*, and *R*denote the relative risk in Danes, San Francisco Chinese, Maya, and Biaka, respectively. In our simulations, we vary the values of *R*(*i*= 1, 2, 3, 4) and disease models.

#### Admixture Populations

We assume that the population under study is the admixture of two ancestral populations. In our simulations, we use Danes and Biaka as the two ancestral populations to represent Europeans and Africans, and extract the allele frequencies of 100 biallelic markers as mentioned above from ALFRED and repeat these 100 markers if we need more marker data. We simulate data sets using similar model to model C in Pritchard *et al.* (2000b). For each individual, we first simulate *q*, where *q* is the fraction of European ancestry and 1 −*q* the fraction of African ancestry. Then, at each locus, two alleles are drawn independently with probability *q* from Danes, and probability 1 −*q* from Biaka allele-frequency distributions.

*Model A.* we assume that *q* is uniformly distributed in interval (0, 1). Normal individuals are sampled from this distribution. In our assessment of type-I errors, 100 normal individuals and 100 diseased individuals are sampled. In order to simulate diseased individuals, we assume that the prevalence of the disease is eight-fold higher in Danes than in Biaka. The rejection sampling as described in Pritchard *et al.* (2000b) is used to simulate the diseased individuals.

*Model B.* Instead of using the uniform distribution to generate *q* in model A, we generate *q* from beta distribution *B*(2, 6) for normal individuals and from *B*(α, 2) for diseased individuals. This means that, on average, 1/4 of the genetic materials are from Danes and 3/4 are from Biaka for normal individuals. For diseased individuals, on average, α/(α+ 2) of the genetic materials are from Danes and 2/(α+ 2) are from Biaka. We use either α= 2 or α= 4 in our simulations.

For the above simulations, we assume that all the subpopulations have the same high risk allele. We also conduct another set of simulations for power comparison under admixture model B and α= 2, where we randomly assign high risk allele independently in each of the ancestral population according to the allele frequency. For example, consider a candidate locus with two alleles *A* and *a*. The allele frequencies of allele *A* in two ancestral populations are *p*_{1} and *p*_{2}, respectively. Then, for each of the 1000 replicated data sets, *A* is assigned as the high risk allele with probabilities *p*_{1} and *p*_{2} in the two ancestral populations, respectively.

#### Other Association Tests Considered

In addition to QualSPT, we also consider several other association tests in our simulations. The first test is the χ^{2} test that ignores potential population stratification. The second test is the GC (Genomic Control) method developed by Devlin & Roeder (1999) and Bacanu *et al.* (2000), which uses the test statistic χ^{2}/λ and the parameter λ is estimated by median(χ^{2}_{1}, χ^{2}_{2}, … , χ^{2}_{L})/0.456, where χ^{2}_{i} is the value of χ^{2} test statistic for the *i*th independent marker. The third test is STRAT proposed by Pritchard *et al.* (2000b). To perform the test of STRAT, we first use *Structure* (Pritchard *et al.* 2000a) to estimate the probabilities that each individual belongs to each subpopulation, under the assumption that the number of subpopulations is known. Using either discrete subpopulation models or admixture population models, we also simulate a set of family triads and apply a TDT test proposed by Spielman *et al.* (1993) to determine whether there is an association between the marker and the trait. We denote this test by TDT. In power comparisons, we simulate 2*n*/3 and *n* triads, respectively, in the family-based association design, where *n* is the total number of diseased individuals in the sample of unrelated individuals. The reason that we cover a range of sample sizes in the power comparisons is that the amount of phenotyping and genotyping is different between the two designs for the same number of individuals. Therefore, it is difficult to select a fixed sample size to make the comparison fair. For each simulation model, we first generate 2*n*/3 and *n* diseased individuals, respectively, in the total population as children, and then generated their parents' genotypes. The p-values of the TDT are also evaluated by the simulations.

#### Results

*Test whether population stratification is reasonably controlled for:* The first step of QualSPT is to evaluate whether the population stratification can be well controlled by a given set of genetic markers. We begin with the use of the first principal component. If the Kolmogorov test indicates that the population stratification can not be well controlled for, we will use the first two or three principal components (see how to choose the number of principal components in Discussion). Figure 1 summarizes the test statistic corresponding to different numbers of genetic markers under three population models. Under discrete population model and when population stratification is strong (*P*_{1}:*P*_{2}:*P*_{3}:*P*_{4}= 1:4:6:8), the Kolmogorov test shows that the population stratification can not be well controlled only using the first principal component and can be well controlled by using the first two principal components (Figure 1(a) and (b)). For almost of all the cases under admixture model A and B, Kolmogorov test shows that the population stratification can be well controlled using the first principal component (Figure 1(c) and (d)). This observation is consistent with the results given in Figure 2, 3 and 4. where we compare the type-I error rates of various statistical tests using different numbers of independent markers. These results suggest that the Kolmogorov test described above has good utility in the determination of whether a set of genomic markers can control for population stratification and in choosing the number of principal components for a given set of independent genetic markers. Under discrete population models, we include the type-I error results using the first principal component and the first two principal components in order to evaluate the performance of the Kolmogorov test. The final results of QualSPT are obtained using the first principal component for the case of *P*_{1}:*P*_{2}:*P*_{3}:*P*_{4}= 1:2:3:4 and the first two principal components for the case of *P*_{1}:*P*_{2}:*P*_{3}:*P*_{4}= 1:4:6:8. In the following discussion, we only compare the final results of QualSPT with the results of other tests.

*Type-I error rates:* It is well known that TDT is robust to population stratification. For our type-I error evaluations, we only consider the four tests, χ^{2}, QualSPT, STRAT and GC, which are based on unrelated samples. Figures 2, 3 and 4 summarize the type-I error rates for the four test statistics by using different numbers of markers in simulations through the discrete population models, admixture population model (A) and (B), respectively. The results are based on 1000 replications (10 replications for each of 100 markers, as if there were 10 × 100 = 1000 replications) with each replication consisting of 100 diseased individuals and 100 normal individuals for all four tests. A total of 1000 simulated data sets are used for each sample in the estimation of the p-values. Therefore, for the statistical significance level of 0.05, the standard error for the type-I error rate estimate is and the 95% confidence interval of the type-I error is (0.036, 0.064). It is apparent from the figures that the χ^{2} test, which ignores potential population stratification, may have type-I error rate that is substantially higher than the nominal level in the presence of population stratification (in all the cases considered here). Under the discrete population models (Figure 2), the type-I errors of both QualSPT (using the first principal component for the case of *P*_{1}:*P*_{2}:*P*_{3}:*P*_{4}= 1:2:3:4 and using the first two principal components for the case *P*_{1}:*P*_{2}:*P*_{3}: *P*_{4}= 1:4:6:8) and STRAT are within the 95% confidence interval of the nominal type-I error rate using independent markers from 100 to 500. Although the type-I error rates of the GC are much smaller than that of χ^{2} test, there are several cases in which the type-I error rates of GC test are beyond the boundaries of the 95% confidence interval and these cases seem independent of the number of the markers used to control for population stratification. Population stratification under the admixture model A (Figure 3) is not as strong as other models: the type-I error rates of χ^{2} test are from 10% to 12% for the nominal level 5%. Under this model, the type-I error rate of QualSPT and GC tests (except using 200 markers) are within the 95% confidence interval. The type-I error rate of STRAT is slightly higher than the upper boundary of 95% confidence interval even as many as 500 markers are used. The population stratification under admixture model B is much stronger than that under admixture population model A. The type-I error rates of χ^{2} test are around 24% and 40% for α= 2 and α= 4, respectively. Under this model, the type-I errors of QualSPT are within the 95% confidence interval of the nominal type-I error rate using 200 independent markers or more; except a few cases, the type-I errors of GC test are also within the 95% confidence interval of the nominal type-I error rate; though the type-I error rate of STRAT decreases with the increasing of the number of markers used to control for population stratification, this error rate seems to be stable at 7 ∼ 8% using 400 independent markers or more.

In summary, the type-I errors of QualSPT are within the 95% confidence interval of the nominal type-I error rate using 200 markers or more for all the cases we considered. The type-I errors of GC are around the nominal level, though there are several cases in which the type-I error rates of GC test are beyond the boundaries of the 95% confidence interval. Under discrete population models, the type-I errors of STRAT are within the 95% confidence interval of the nominal type-I error rate using 100 independent markers or more. Under admixture models, the type-I error rate of STRAT decrease with the number of independent markers used to control for population stratification. However, when the number of independent markers is more than 300, type-I error rates of STRAT decrease very slowly and are still beyond the upper boundary of the 95% confidence interval even using 500 markers.

*Power Comparisons:* In this set of simulations, we compare the power of the four tests, QualSPT, STRAT, GC and TDT. The results are based on 1000 replications with each replication consisting of *n*= 100 diseased individuals and 100 normal individuals for QualSPT, STRAT and GC, and 2*n*/3 and *n* triads for TDT test. A total of 300 markers are used by QualSPT, STRAT and GC to control for population stratification. For almost all the cases including different population models and different disease models we considered, QualSPT is more powerful than TDT when TDT uses 2*n*/3 triads, and QualSPT is less powerful than TDT when TDT uses *n* triads. The power comparisons of the three tests, QualSPT, STRAT and GC, are more complicated.

The results of our power comparisons under discrete population model and the assumption of the same high risk allele in all the subpopulations are summarized in Table 1. For almost all the cases in this set of simulations, QualSPT is more powerful than STRAT. The power of GC is substantially lower than that of both QualSPT and STRAT, especially when population stratification becomes stronger (*P*_{1}:*P*_{2}:*P*_{3}:*P*_{4}= 1:4:6:8).

Table 1. Power comparisons of the four tests under discrete subpopulation models and the assumption of same high risk allele in all the subpopulations. The sample size is *n*= 100 diseased and 100 normal individuals for QualSPT, SRAT, and GC. The sample size is 2*n*/3 and *n* triads for TDT. *P*_{i} is the relative disease prevalence of the *i*th subpopulation and *R* is the relative risk of genotype *AA* in the *i*th subpopulation (*i*= 1, 2, 3, 4)*P*_{1}:*P*_{2}:*P*_{4}:*P*_{4} *R*,*R*,*R*,*R* | Disease Model | *P*= 0.05 | *P*= 0.01 |
---|

QualSPT | STRAT | GC | | QualSPT | STRAT | GC | |
---|

1:2:3:4 |

1,2,3,4 | Domi. | 0.42 | 0.33 | 0.21 | 0.34 0.48 | 0.24 | 0.15 | 0.06 | 0.15 0.27 |

| Add. | 0.28 | 0.30 | 0.28 | 0.32 0.46 | 0.10 | 0.11 | 0.10 | 0.13 0.23 |

| Rec. | 0.35 | 0.30 | 0.23 | 0.34 0.43 | 0.18 | 0.16 | 0.10 | 0.17 0.26 |

2,4,6,8 | Domi. | 0.86 | 0.76 | 0.58 | 0.73 0.84 | 0.75 | 0.60 | 0.30 | 0.52 0.72 |

| Add. | 0.79 | 0.65 | 0.59 | 0.74 0.87 | 0.59 | 0.45 | 0.31 | 0.52 0.73 |

| Rec. | 0.78 | 0.71 | 0.59 | 0.71 0.80 | 0.65 | 0.57 | 0.42 | 0.53 0.70 |

1:4:6:8 |

1,4,6,8 | Domi. | 0.88 | 0.64 | 0.52 | 0.70 0.83 | 0.77 | 0.46 | 0.27 | 0.50 0.70 |

| Add. | 0.77 | 0.70 | 0.47 | 0.72 0.87 | 0.61 | 0.53 | 0.24 | 0.52 0.75 |

| Rec. | 0.80 | 0.73 | 0.51 | 0.72 0.81 | 0.68 | 0.60 | 0.34 | 0.56 0.69 |

2,8,12,16 | Domi. | 0.97 | 0.80 | 0.78 | 0.84 0.90 | 0.94 | 0.65 | 0.61 | 0.71 0.84 |

| Add. | 0.95 | 0.90 | 0.65 | 0.90 0.96 | 0.89 | 0.79 | 0.36 | 0.80 0.91 |

| Rec. | 0.96 | 0.90 | 0.78 | 0.90 0.94 | 0.90 | 0.83 | 0.68 | 0.84 0.90 |

Under admixture population model A and the assumption of the same high risk allele in two ancestral populations, the results of power comparisons are given in Table 2. Under this population model, the population stratification is not strong. The power of the three tests, QualSPT, STRAT and GC, are similar.

Table 2. Power comparisons of four tests under admixture population model A and the assumption of the same high risk allele in the two ancestral populations. The sample size is *n*= 100 diseased and 100 normal individuals for QualSPT, SRAT, and GC. The sample size is 2*n*/3 and *n* triads for TDT. *R**A*_{i} is the relative risk of genotype *AA* in the *i*th ancestral population (*i*= 1, 2)*R*,*R* | Disease Model | *P*= 0.05 | *P*= 0.01 |
---|

QualSPT | STRAT | GC | | QualSPT | STRAT | GC | |
---|

2,2 | Domi. | 0.66 | 0.67 | 0.63 | 0.57 0.72 | 0.44 | 0.43 | 0.40 | 0.36 0.51 |

| Add. | 0.60 | 0.65 | 0.64 | 0.55 0.70 | 0.39 | 0.44 | 0.41 | 0.35 0.48 |

| Rec. | 0.58 | 0.60 | 0.61 | 0.52 0.63 | 0.34 | 0.39 | 0.37 | 0.33 0.44 |

4,4 | Domi | 0.93 | 0.92 | 0.89 | 0.88 0.95 | 0.88 | 0.83 | 0.74 | 0.77 0.89 |

| Add. | 0.93 | 0.92 | 0.79 | 0.90 0.96 | 0.85 | 0.84 | 0.59 | 0.76 0.90 |

| Rec. | 0.93 | 0.92 | 0.80 | 0.89 0.94 | 0.83 | 0.82 | 0.64 | 0.76 0.89 |

2,4 | Domi | 0.87 | 0.86 | 0.81 | 0.82 0.91 | 0.75 | 0.73 | 0.69 | 0.64 0.80 |

| Add. | 0.87 | 0.89 | 0.83 | 0.82 0.90 | 0.74 | 0.75 | 0.71 | 0.67 0.81 |

| Rec. | 0.84 | 0.84 | 0.74 | 0.80 0.88 | 0.71 | 0.72 | 0.59 | 0.63 0.79 |

4,8 | Domi. | 0.96 | 0.94 | 0.95 | 0.94 0.97 | 0.94 | 0.90 | 0.90 | 0.87 0.94 |

| Add. | 0.97 | 0.95 | 0.93 | 0.95 0.98 | 0.94 | 0.91 | 0.87 | 0.89 0.95 |

| Rec. | 0.96 | 0.95 | 0.94 | 0.94 0.98 | 0.92 | 0.92 | 0.91 | 0.88 0.95 |

Table 3. Power comparisons of four tests under admixture population model B and the assumption of the same high risk allele in the two ancestral populations. The sample size is *n*= 100 diseased and 100 normal individuals for QualSPT, SRAT, and GC. The sample size is 2*n*/3 and *n* triads for TDT. *R* is the relative risk of genotype *AA* in the *i*th ancestral population (*i*= 1, 2). α is the parameter in the Gamma distribution as described in the text α *R*,*R* | Disease Model | *P*= 0.05 | *P*= 0.01 |
---|

QualSPT | STRAT | GC | | QualSPT | STRAT | GC | |
---|

2 |

1,2 | Domi. | 0.55 | 0.57 | 0.29 | 0.55 0.70 | 0.33 | 0.33 | 0.11 | 0.33 0.50 |

| Add. | 0.56 | 0.55 | 0.23 | 0.50 0.68 | 0.33 | 0.33 | 0.07 | 0.27 0.44 |

| Rec. | 0.56 | 0.58 | 0.34 | 0.49 0.64 | 0.33 | 0.37 | 0.13 | 0.29 0.45 |

2,4 | Domi. | 0.94 | 0.89 | 0.71 | 0.87 0.95 | 0.85 | 0.76 | 0.45 | 0.74 0.89 |

| Add. | 0.92 | 0.88 | 0.50 | 0.86 0.94 | 0.79 | 0.75 | 0.20 | 0.70 0.88 |

| Rec. | 0.89 | 0.87 | 0.62 | 0.85 0.92 | 0.76 | 0.74 | 0.31 | 0.70 0.85 |

4 |

1,2 | Domi. | 0.38 | 0.37 | 0.11 | 0.40 0.60 | 0.16 | 0.17 | 0.02 | 0.17 0.30 |

| Add. | 0.30 | 0.31 | 0.10 | 0.37 0.54 | 0.14 | 0.14 | 0.02 | 0.16 0.32 |

| Rec. | 0.28 | 0.30 | 0.10 | 0.32 0.48 | 0.14 | 0.15 | 0.03 | 0.17 0.30 |

2,4 | Domi. | 0.82 | 0.74 | 0.36 | 0.81 0.90 | 0.63 | 0.53 | 0.15 | 0.60 0.80 |

| Add. | 0.80 | 0.75 | 0.36 | 0.76 0.90 | 0.57 | 0.54 | 0.14 | 0.59 0.76 |

| Rec. | 0.74 | 0.70 | 0.27 | 0.75 0.90 | 0.48 | 0.45 | 0.12 | 0.53 0.68 |

The results of power comparisons under assumption of random high risk allele are summarized in Table 4. In this set of simulations, STRAT is slightly more powerful than QualSPT and both STRAT and QualSPT are much more powerful than GC.

Table 4. Power comparisons of four tests under admixture population model B and the assumption of random high risk alleles in the two ancestral populations. The sample size is *n*= 100 diseased and 100 normal individuals for QualSPT, SRAT, and GC. The sample size is 2*n*/3 and *n* triads for TDT. *R* is the relative risk of genotype *AA* in the *i*th ancestral population (*i*= 1, 2). Here, allele *A* denotes the high risk allele which may be different in different ancestral populations *R*,*R* | Disease Model | *P*= 0.05 | *P*= 0.01 |
---|

QualSPT | STRAT | GC | | QualSPT | STRAT | GC | |
---|

1,2 | Domi. | 0.55 | 0.71 | 0.19 | 0.31 0.41 | 0.33 | 0.58 | 0.05 | 0.14 0.25 |

| Add. | 0.56 | 0.51 | 0.10 | 0.30 0.41 | 0.33 | 0.33 | 0.04 | 0.13 0.21 |

| Rec. | 0.56 | 0.42 | 0.12 | 0.26 0.34 | 0.33 | 0.23 | 0.03 | 0.13 0.18 |

2,4 | Domi. | 0.94 | 0.74 | 0.30 | 0.40 0.60 | 0.85 | 0.63 | 0.14 | 0.23 0.41 |

| Add. | 0.92 | 0.66 | 0.23 | 0.50 0.63 | 0.79 | 0.48 | 0.10 | 0.31 0.43 |

| Rec. | 0.89 | 0.89 | 0.33 | 0.59 0.69 | 0.76 | 0.76 | 0.16 | 0.39 0.52 |