Blindly Using Wald's Test Can Miss Rare Disease-Causal Variants in Case-Control Association Studies

Authors


Chao Xing, Ph.D., McDermott Center of Human Growth and Development, University of Texas Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas TX 75390, USA. Tel: 214-648-1695; Fax: 214-648-1666; E-mail: chao.xing@utsouthwestern.edu

Summary

There are four tests – the likelihood ratio (LR) test, Wald's test, the score test and the exact test – commonly employed in genetic association studies. On comparison of the four tests, we found that Wald's test, popular in genome-wide screens due to its low computational demands, exhibited a paradoxical behaviour in that the test statistic decreased as the effect size of the variant increased, resulting in a loss of power. The LR test always achieved the most significant P-values, followed by the exact test. We further examined the results in a real data set composed of high- and low-cholesterol subjects from the Dallas Heart Study (DHS). We also compared the single-variant LR test with two multi-variant analysis approaches – the burden test and the C-alpha test – in analysing the sequencing data by simulation. Our results call for caution in using Wald's test in genome-wide case-control association studies and suggest that the LR test is a better alternative in spite of its computational demands.

Introduction

To test for association between genotype and phenotype in a case-control study, one can either employ the exact test or fit a logistic regression model to the data and test whether the coefficient for genotype is zero or not. There are three asymptotic tests that are commonly employed: the likelihood ratio (LR) test, Wald's test and the score test. The LR test compares the null and alternative hypotheses on an equal basis, while Wald's test starts at the alternative and considers movement towards the null and the score test begins with the null and asks whether movement towards the alternative could be an improvement. The three tests have equivalent asymptotic power for testing local alternatives (Cox & Hinkley, 1974). From a computational standpoint, the LR test is most demanding because it requires both the restricted and unrestricted estimates of parameters, whereas Wald's test uses only the unrestricted estimates and the score test uses only the restricted estimates. Besides hypothesis testing, investigators are also interested in estimating a variant's odds ratio inline image, where inline image is the unrestricted maximum likelihood estimate (MLE) of the coefficient for genotype. Computer programs often produce inline image and its estimated variance inline image, which makes it convenient to compute the Wald's test statistic inline image to test the null hypothesis β=βnull. Thus, Wald's test is often the default option – for example, –logistic command in PLINK (Purcell et al., 2007) – in a genome-wide scan; in particular, when covariates are present.

However, we notice an anomalous behaviour of Wald's test; if a variant is mainly present in cases or controls, which means large effect sizes under the alternative hypothesis, Wald's test generates an insignificant P-value. On the contrary, the other two tests produce significant P-values. This abnormal phenomenon of Wald's test may have been observed by many researchers, but its theoretical interpretation is less understood; in a binary logit model, as the distance between the parameter estimate and the null value increases, the test statistic decreases to zero and the power of the test diminishes to the test size (Hauck & Donner, 1977). This aberrant behaviour of Wald's test is particularly pertinent to low-frequency variants. Suppose a causal variant with high penetrance is present at low frequency in the cases and nearly absent from the controls; the power of Wald's test will be minimal even if the effect size estimate of the variants is large. Were Wald's test employed, the causal variant would not show statistically significant association with the disease, and one could miss the association by only screening a list of P-values. Thus, alternative tests – the LR test, the score test and the exact test – should be considered in this situation. In this paper, we compared the four tests in terms of both validity when a variant is at low frequency, and power when a low-frequency variant is highly penetrant.

Methods

Consider a test for association between a low-frequency single-nucleotide polymorphism (SNP) and the disease affection status in a case-control study. Denote by A and a, the major and minor alleles of the site, respectively, with the frequency of a sufficiently low that we only observe genotypes AA and Aa, but not aa, in the sample. Data can then be summarised into a 2 × 2 contingency table (Table 1). Denote by Yi, the affection status of individual i, and Yi= 1 or 0 indicates individual i being a case or control. Denote by Xi, the genotypic value of individual i, and Xi= 1 or 0 indicates individual i being Aa or AA. To test for association between the genotype and phenotype, we fit to the data a logistic regression model inline image, where inline image. The hypotheses to be tested are H01= 0  versus  H11≠ 0. Denote by l(·) the log-likelihood function, by inline image the MLE of β1, by S(·) the score statistic and by I(·) the information matrix. The LR test statistic is defined as inline image. Wald's test statistic is defined as inline image, where inline image is the estimated variance of inline image. The score test is defined as inline image. Note that the widely used Armitage's trend test (Armitage, 1955; Sasieni, 1997) and Pearson's χ2 test (Pearson, 1900) in genetic studies are score tests, and for data in Table 1, both tests lead to the same test statisticinline image as ST. All three statistics –LRT, WT and ST– follow an asymptotic distribution of inline image, and therefore they should have approximately equivalent power, though a systemic inequality WTLRTST exists (Berndt & Savin, 1977). Given a sample, when β1 is far from 0, we desire adequate power from all three tests to reject the null hypothesis. However, it was shown that inline image (Hauck & Donner, 1977), and therefore the power of Wald's test diminishes to the test level in this situation. Another commonly used test is Fisher's exact test (Fisher, 1925), which calculates the exact significance level by assuming a hypergeometric distribution for contingency tables. When a sample size is small, Fisher's exact test is often preferred to the other three tests that rely on asymptotic theories to evaluate significance levels.

Table 1.  Distribution of cases and controls by genotype.
PhenotypeGenotypeTotal
AAAa
Caser0r1R
Controls0s1S
Totaln0n1N

Although all four tests are assumed to maintain proper test sizes in general, in case of low-frequency exposure variables their properties are unclear. We first compared their type I error rates in testing for association of low-frequency variants by large-scale simulations. We considered a balanced design with an equal number (500, 1000 and 2000) of cases and controls, and a SNP with the minor allele frequency (MAF) equal to 0.005. In each setting, 100,000 replicates, in each of which the SNP was polymorphic, were generated. We performed the four tests on all the datasets. Except in the case of the exact test, we calculated P-values assuming an asymptotic distribution of inline image. The empirical type I error rates at four nominal levels (inline image) were calculated as the proportion of the 100,000 replicates for which the P-value was less than or equal to α.

Second, we compared the significance levels attained by the four tests as the effect size of a low-frequency variant increased. In this paper, we define the effect size as relative risk, i.e. inline image. Consider a balanced design with 2000 cases and 2000 controls, and a SNP with the MAF at most equal to 0.005 in cases and even less frequent in controls, i.e. r1≤ 20 and s1 < r1 in Table 1. Given r1, as s1 decreased, the effect size of allele a increased. We compared the significance levels that the four tests could attain for each combination of r1 and s1.

Third, we considered a special scenario in which the frequency of a highly penetrant disease-causal variant was constrained to be low under selection pressure such that in a sample of a case-control study this variant appeared only scarcely in cases, but not at all in controls, i.e. s1= 0 and 0 < r1 < < r0 in Table 1. We investigated three sample sizes inline image. In each setting, we fixed s1 to be 0, constrained the MAF of the variant in cases less than or equal to 0.005, and enumerated all the possibilities, i.e. inline image. We compared the significance levels that the four tests could attain in all of the 35 possible situations.

In a genetic study, it is often necessary to adjust for covariates such as known risk factors and confounding factors (Xing & Xing, 2010). Therefore, besides the genetic variant, we also simulated a binary covariate mimicking an environmental factor independent of the genetic variant with the exposure rates of 0.3 and 0.2 in cases and controls, respectively. Wald's test, the LR test and the score test are readily applicable to multiple logistic regression models. As to the counterpart of Fisher's exact test, we performed the conditional exact inference by enumerating the exact distributions of sufficient statistics for the parameter of interest conditional on the remaining parameters in a logistic regression model (Hirji et al., 1987; Cox & Snell, 1989), as implemented in the SAS procedure LOGISTIC (Derr, 2009).

Fourth, as a proof-of-principle example, we performed Wald's test, the LR test and the score test in a genome-wide association (GWA) study of plasma levels of low-density lipoproteins (LDL) cholesterol with nonsynonymous SNPs in the African Americans (AAs) of the Dallas Heart Study (DHS), and compared their performance in testing for low-frequency variants present only in the upper quintile of the population. The DHS is a multiethnic, population-based cohort in Dallas County (Victor et al., 2004). In this study, we focused on the AAs. There were 1722 individuals with complete phenotypes – LDL, body mass index (BMI), age, sex – after deleting those taking cholesterol-lowering medicine. We sampled the upper and lower quintiles (N= 345) as cases and controls, respectively. There were a total of 8968 nonsynonymous SNPs assayed across the autosomes. We first filtered out singletons and those out of Hardy–Weinberg equilibrium (P-value < 1.0 × 10−5) in the whole population, then filtered out monomorphic ones in the case-control sample, resulting in 8263 SNPs for further analysis. The data were fit by a logistic regression model inline image, where π was probability of being affected, G was genotypic value coded in an additive genetic model with 0, 1 and 2 denoting major allele homozygote, heterozygote and minor allele homozygote, respectively, and ancestry of each individual was inferred using ancestry-informative markers as described elsewhere (Romeo et al., 2008). We tested whether β1= 0 by Wald's test, the LR test and the score test. The exact test was not performed because of memory constraints when multiple covariates were present.

Fifth, we evaluated the performance of the single-variant LR test in analyzing sequencing data by simulation studies based on the results of a pooled sequencing study. Neale et al. (2011, see their table 1) reported results of pooled sequencing of the ApoB gene in 96 individuals with high triglyceride levels exceeding the 5% upper tail of the population distribution and 96 individuals with low triglyceride levels below the 5% lower tail. There were a total of 27 nonsynonymous variants detected, among which 10 were singletons with 6 in the upper tail and 4 in the lower tail. The distribution of singletons between the two tails was relatively balanced, which contributed little information in testing association between the gene and phenotype; therefore, we simulated data based on the distribution of the remaining 17 variants. Five sample sizes –n× 96, n∈{1, 2, 3, 4, 5}, cases and controls, respectively – were considered. For each variant, we fixed the total number of counts as n times the number observed in the original data and decided its distribution between cases and controls by a binomial trial. The binomial parameter p equalled 0.5 if the variant was designated as neutral; otherwise it equalled its MLE in the observed data, and the MLE was set to 0.005 (or 0.995) when a variant was only observed in one tail. Two types of simulation schemes were carried out (Table 2). In the first type (A1–A3), a mixture of risk, protective and neutral variants were simulated. In particular, in scheme A1 all 17 variants were simulated at p= MLE. In scheme A2 there were nine variants simulated at p= MLE and others at p= 0.5; the variant with the largest effect size was N1914S, whose occurrence counts ratio (CR) between the upper and lower tails was 0:5n. In scheme A3, P1143S replaced N1914S as the variant with the largest effect size (CR= 0:6n). In the second type (B1–B3), in addition to neutral variants, only variants acting in the same direction were simulated. The difference among schemes was the variant with the largest effect size –CR= 0:3n, 0:5n and 0:6n for B1, B2 and B3, respectively. In each scheme 1000 replicates were generated and analyzed by the single-variant LR test, a version of burden tests (Morris & Zeggini, 2010) and the C-alpha test (Neale et al., 2011). For both the burden test and the C-alpha test, a single P-value for the gene could be obtained; for the LR test, the minimum of the 17 single-variant Bonferroni-corrected P-values was designated as the P-value for the gene. The empirical power at α level was calculated as the proportion of the 1000 replicates for which the P-value was less than or equal to α.

Table 2.  Simulation schemes based on pooled sequencing of ApoB in the upper and lower 5% tails of the distribution of triglyceride levelsa.
VariantNumber of countsSimulation schemeb
Upper tailLower tailA1A2A3B1B2B3
  1. aThe original result was reported in Table 1 of Neale et al. (2011), in which there were 96 individuals sequenced in each tail. In the simulation study, five sample sizes –n× 96, n∈ (1, 2, 3, 4, 5), cases and controls, respectively – were considered. bFor each variant the total number of counts was fixed at n times the number observed in the original data. Symbols √ and × denote a variant was simulated with the binomial parameter being the MLE of the observed data and a half, respectively.

A4481T25
I4314V30×××
R4270T63×××
V4128M17×××××
T3388K21×××××
S3203Y60×××××
L2404I23×××××
E2391D22×××××
T2373N22×××××
V2313I21×××××
H1923R612
N1914S05×××
D1871N20×××
P1143S06×××
R1128H03××
D1113H13
T498N20×××

Results

The empirical type I error rates of the four tests in case of low-frequency variants are summarised in Table 3. The LR test showed a slightly supranominal test size when the sample size was 500; as the sample size increased to 2000, it maintained a proper test size. In contrast, the other three tests all showed an infranominal test size even when the sample size was as large as 2000. The score test and the exact test had similar type I error rates, and the former was more conservative when there was no covariate, but the latter was more conservative when there was a covariate. Compared with these two tests, Wald's test became more and more conservative when the nominal level turned more stringent.

Table 3.  Empirical type I error ratesa of Wald's test, likelihood ratio (LR) test, score test and exact test for a variant with the minor allele frequency (MAF) equal to 0.005.
Sample sizeTest sizeWithout covariatesbWith a covariate
Wald's testLR testScore testExact testWald's testLR testScore testExact test
  1. aBased on 100,000 replicates, in each of which the variant was polymorphic and the number of cases/controls equalled that in the first column. bThe covariate was binary with exposure rates of 0.3 and 0.2 in cases and controls, respectively.

5000.050.0190.0700.0210.0220.0180.0590.0480.030
 0.010.0000.0150.0020.0030.0000.0140.0070.005
 0.0010.00000.00220.00010.00030.00000.00160.00030.0004
 0.00010.000000.000200.000000.000000.000000.000200.000000.00001
10000.050.0400.0520.0250.0280.0400.0540.0500.038
 0.010.0030.0120.0040.0050.0040.0110.0090.007
 0.0010.00000.00150.00030.00060.00000.00120.00060.0006
 0.00010.000000.000180.000010.000050.000000.000190.000010.00002
20000.050.0460.0520.0340.0350.0450.0510.0490.041
 0.010.0080.0110.0060.0060.0070.0110.0100.008
 0.0010.00030.00110.00040.00060.00040.00110.00090.0008
 0.00010.000000.000110.000040.000040.000000.000140.000070.0006

As the effect size of a low-frequency variant increased, Wald's test showed an aberrant behaviour of attaining less significant levels (Table 4). Given r1, the effect size of a allele increased as s1 decreased, and we would expect a test to attain more significant levels. However, when r1 < 15, the P-values for Wald's test increased as s1 decreased from 1 to 0; when r1≥ 15, the P-values increased as s1 decreased from 2 to 1 and then to 0. In particular, when s1= 0, the effect size of a allele approached infinity, and Wald's test led to P-values ≥ 0.9. This anomalous behaviour of Wald's test can be visualised in Figure 1. Given r1= 20, when s1= 19, allele a had no detectable effect on the trait. As s1 decreased, the effect size of allele a increased, as did its estimate inline image and the standard error of inline image. WT also increased and reached its maximum when s1= 2. As s1 further decreased, the increase of inline image was slower than that of its the standard error, and WT started decreasing. In the extreme case of s1= 0, WT was approaching zero, too. In the extreme situations where a variant appeared only scarcely in cases but not at all in controls, with the increase of allele a effect size, i.e. the number of cases with genotype Aa given a sample size, and the increase of sample size given a certain effect size, Wald's test always remained at levels ≥ 0.9 (Table 5).

Table 4.  Comparison of significance levelsa attained by Wald's test, likelihood ratio (LR) test, score test and exact test as the effect size of a rare variant increasesb.
No. cases with AaNo. controls with AaWithout covariatesWith a covariatec
Wald's testdLR testScore testExact testWald's testdLR testScore testExact test
  1. aExcept the exact test, P-values were calculated under the asymptotic distribution of inline image; when there was a covariate, the mean P-values of 1000 replicates were reported. bThe structure of data was as Table 1 with R=S= 2000 and the numbers in the first and second columns corresponding to r1 and s1, respectively. The maximal minor allele frequency (MAF) was 0.005. cThe covariate was binary with exposure rates of 0.3 and 0.2 in cases and controls, respectively. dThe aberrant P-values by Wald's test were in bold and italic font.

1009.36E–011.94E–044.38E–031.93E–039.35E–012.24E–041.76E–031.49E–03
 12.79E–023.42E–031.57E–021.16E–022.87E–023.85E–037.34E–039.65E–03
 23.74E–021.57E–024.30E–023.83E–023.88E–021.70E–022.23E–023.12E–02
 36.67E–024.58E–029.56E–029.18E–026.89E–024.81E–025.42E–027.24E–02
 41.20E–011.03E–011.81E–011.79E–011.25E–011.07E–011.13E–011.15E–01
 52.05E–011.92E–013.01E–013.01E–012.10E–011.98E–012.02E–012.10E–01
1209.30E–014.44E–051.47E–034.80E–049.35E–015.36E–056.20E–043.85E–04
 11.67E–029.11E–045.47E–033.37E–031.72E–021.06E–032.59E–032.67E–03
 21.87E–024.81E–031.60E–021.28E–021.98E–025.45E–038.35E–031.06E–02
 33.13E–021.60E–023.85E–023.48E–023.25E–021.71E–022.12E–022.92E–02
 45.63E–024.04E–027.95E–027.62E–025.91E–024.30E–024.79E–026.18E–02
 59.90E–028.42E–021.45E–011.43E–011.03E–018.78E–029.25E–021.33E–01
1509.49E–014.96E–062.93E–045.94E–059.48E–016.13E–061.27E–044.93E–05
 18.43E–031.23E–041.13E–035.07E–048.56E–031.43E–045.21E–044.17E–04
 27.30E–037.77E–043.54E–032.30E–037.64E–039.05E–041.82E–031.89E–03
 31.07E–023.06E–039.36E–037.41E–031.12E–023.41E–035.07E–036.11E–03
 41.85E–029.05E–032.15E–021.89E–021.95E–029.91E–031.25E–021.29E–02
 53.27E–022.19E–024.36E–024.09E–023.55E–022.43E–022.76E–023.27E–02
1809.44E–015.63E–075.92E–057.34E–069.44E–017.08E–072.61E–055.95E–06
 14.80E–031.63E–052.34E–047.37E–054.91E–032.01E–051.12E–046.39E–05
 23.11E–031.20E–047.72E–043.90E–043.28E–031.46E–044.02E–043.40E–04
 33.94E–035.48E–042.19E–031.45E–034.16E–036.35E–041.18E–031.22E–03
 46.34E–031.86E–035.45E–034.24E–036.79E–032.12E–033.14E–033.12E–03
 51.10E–025.10E–031.21E–021.04E–021.19E–025.79E–037.40E–037.68E–03
2009.41E–011.33E–072.05E–051.82E–069.40E–011.77E–072.59E–051.60E–06
 13.37E–034.22E–068.21E–052.01E–053.49E–035.62E–064.22E–051.77E–05
 21.84E–033.41E–052.79E–041.17E–041.93E–034.23E–051.46E–041.01E–04
 32.08E–031.69E–048.20E–044.72E–042.24E–032.02E–044.46E–044.16E–04
 43.20E–036.22E–042.13E–031.50E–033.42E–037.26E–041.22E–039.50E–04
 55.38E–031.85E–034.97E–033.97E–035.77E–032.09E–032.94E–032.73E–03
Figure 1.

Maximum likelihood estimate (MLE) of coefficient, its estimated standard deviation and Wald's test statistic in a logistic regression model.
(A): There was no additional covariate. (B): A binary covariate with exposure rates of 0.3 and 0.2 in cases and controls, respectively, was generated in 1000 replicates, and the mean estimates were plotted. The sample size was 2000 cases and controls, respectively. The number of cases with genotype Aa was fixed to be 20. The number of controls with genotype Aa decreased by 1 from 19 to 0, corresponding to the 20 points from left to right along the x-axis. Note that when the number of controls with genotype Aa equalled 0, the MLE of coefficient was infinity. In the figure, we substituted it with a value that met the convergence tolerance criterion (10−8) by Fisher's scoring algorithm as implemented in R (R Development Core Team, 2011).

Table 5.  Comparison of significance levelsa attained by Wald's test, likelihood ratio (LR) test, score test and exact test when rare variants appear only in casesb.
Sample sizeNo. cases with AaWithout covariatesWith a covariatec
Wald's testLR testScore testExact testWald's testLR testScore testExact test
  1. aExcept the exact test, P-values were calculated under the asymptotic distribution of inline image; when there was a covariate, the mean P-values of 1000 replicates were reported. bThe structure of the table is as Table 1, with s1= 0, inline image and the numbers in the first column corresponding to R, and also S. cThe covariate was binary with exposure rates of 0.3 and 0.2 in cases and controls, respectively.

50019.69E–012.39E–011.00E–001.00E–009.69E–012.47E–013.26E–016.34E–01
 39.65E–014.12E–022.48E–012.49E–019.68E–014.37E–028.71E–021.99E–01
 59.71E–018.35E–037.29E–026.19E–029.70E–018.91E–034.27E–022.64E–02
100019.69E–012.39E–011.00E–001.00E–009.69E–012.45E–013.23E–016.23E–01
 59.55E–018.41E–037.33E–026.22E–029.54E–019.15E–034.35E–022.71E–02
 109.58E–011.91E–044.33E–031.91E–039.58E–012.27E–041.78E–031.53E–03
200019.53E–012.39E–011.00E–001.00E–009.53E–012.47E–013.26E–016.35E–01
 59.55E–018.44E–037.35E–026.23E–029.54E–019.08E–032.70E–024.32E–02
 109.36E–011.94E–044.38E–031.93E–039.35E–012.24E–041.76E–031.49E–03
 159.49E–014.96E–062.93E–045.94E–059.48E–016.13E–061.27E–044.93E–05
 209.41E–011.33E–072.05E–051.82E–069.40E–011.77E–072.59E–051.60E–06

In contrast to Wald's test, the LR test, the score test and the exact test all attained more significant levels as the effect size of allele a increased (Tables 4 and 5). Among the four tests, the LR test always outperformed the others. Without covariates, the exact test produced P-values slightly smaller than those of the score test; but this trend was not obvious when a covariate was included, possibly because of random noises in simulation.

In the DHS there were 75 variants present in the cases but not in the controls, out of which two variants appeared eight times, six variants appeared five times, four variants appeared four times, 14 variants appeared three times, 21 variants appeared twice and the remaining 28 variants appeared only once. All of the 12 SNPs manifesting four times or more in the cases attained a nominal P-value less than 0.05 by the LR test, and we listed their P-values along with the ranks for all three tests in Table 6. By the LR test, two SNPs ranked in the top 20 observed P-values, four ranked in the top 50 and seven ranked in the top 100 in the genome scan; by the score test two ranked in the top 100; in contrast, by Wald's test all 12 SNPs remained at levels ≥ 0.9.

Table 6.  Significant nonsynonymous variants present only in cases at a nominal level of 0.05 by the likelihood ratio (LR) test in a case-control study of low-density lipoproteins (LDL) cholesterol in the Dallas Heart Study (DHS) African Americansa.
Identificationb SNPCountcMAFdWald's testLR testScore test
P-valueRankeP-valueRankeP-valueRanke
  1. aThe upper and lower quintiles of the DHS African American population by LDL were treated as cases and controls (N= 345), respectively.bPerlegen identification.cCopies of variants in cases. All carriers were heterozygotes except for one minor allele homozygote for p4652837. dMinor allele frequency (MAF) in the DHS African American population. eOut of 8263 variants in ascending order.

p475998680.00439.75E–0179201.37E–03117.35E–0358
p169477180.00429.76E–0179311.40E–03127.48E–0359
p341979450.00339.79E–0180074.36E–03341.35E–02111
p477954850.00209.70E–0178315.92E–03491.78E–02160
p422201150.00549.80E–0180319.00E–03812.65E–02248
p465283750.00539.78E–0179801.14E–02984.23E–02374
p400495150.00519.71E–0178451.39E–021304.13E–02362
p477621150.00279.71E–0178581.82E–021845.27E–02468
p412991740.00389.72E–0178781.07E–02942.58E–02238
p475284340.00319.73E–0179021.26E–021183.00E–02281
p162949740.00399.73E–0179012.74E–022796.70E–02584
p422912040.00319.73E–0179043.31E–023237.89E–02678

We compared the power of the burden test, C-alpha test and LR test in analyzing sequencing data under different sample sizes at an arbitrary level of 0.0001 (Fig. 2), which was chosen such that the power of tests in all simulation situations spread from 0 to 1. In scheme A1 data was simulated mimicking the results of the original sequencing study, in which both risk and protective variants were present. At the original sample size, the power of the C-alpha test was 0.343, whereas that of the LR test and burden test was close to 0; when the sample size was doubled, the power of the C-alpha test rapidly increased to 1, whereas that of the other two tests only had a mild increase; when the sample size was tripled, the power of the LR test increased to 0.997, whereas that of the burden test was still less than 0.1. Although both risk and protective variants were also present in schemes A2 and A3, the distribution of variants was less dispersed than that of scheme A1, which weakened the power of the C-alpha test at certain sample size ranges. At the original sample size, the power of all three tests was close to 0; when the sample size was doubled, the power of the C-alpha test increased to ∼0.6, whereas that of the other two tests was nearly unchanged; when the sample size was tripled, the power of the C-alpha test was close to 1 and that of the LR test rapidly increased to ∼0.9, whereas the power of the burden test was still less than 0.1.

Figure 2.

Empirical power comparison of the burden test, C-alpha test and single-variant likelihood ratio (LR) test in analyzing sequencing data. x-axis denotes the sample size at each tail, and y-axis denotes the empirical power at an arbitrary level of 0.0001, which was calculated as the proportion of the 1000 replicates for which the P-value was less than or equal to 0.0001. For both the burden test and the C-alpha test, a single P-value for the gene was obtained; for the LR test, the minimum of the 17 single-variant Bonferroni-corrected P-values was designated as the P-value for the gene.

In schemes B1–B3 multiple variants acted in the same direction. The burden test was more powerful than in schemes A1–A3, and it was more powerful than the other two tests when the sample size was small. With the increase of sample size, both the C-alpha test and the LR test outperformed the burden test. The C-alpha test suffered power loss when the effects were less dispersed compared to those in schemes A1–A3. The power of all three tests increased as the effect sizes of variants increased from scheme B1 to B2 and then to B3; however, unlike the power of the other two tests that depends on the collective effects of multiple variants, for a given sample size, the power of the LR test could dramatically increase when the largest effect size of variants increased to a certain level.

Discussion

A conventional recommendation in testing for small-sample categorical data, in particular, when the smallest expected counts in a cell is less than five (Cochran, 1952), is to use the exact test instead of the score test. However, both tests are conservative (Campbell, 2007). Small observed counts have more impact on the LR test statistic than on the score test statistic such that the latter is recommended in small-sample datasets because the former has inflated type I error rate (Larntz, 1978). In this paper, we consider a situation of small expected/observed counts in a large-sample dataset, which is the case for low-frequency variants in genetic studies. We find that both the score test and the exact test are still conservative, whereas the LR test statistic maintains a proper size. Wald's test is very conservative at stringent test levels. Therefore, regardless of the aberrant behaviour of Wald's test under the extreme alternative hypothesis that we report in this paper, the LR test is preferable in large-scale genetic studies of low-frequency variants.

Our results suggest that blindly using Wald's test in association studies carries unrecognised risks of failing to identify low-frequency disease-causal variants. This issue concerns both GWA and sequencing data. The power of a genetic test depends on the sample size, a variant's MAF, its effect size and its degree of linkage disequilibrium with the disease-casual variation (Chapman et al., 2003). A conventional wisdom in analyzing sequencing data, which are mainly composed of low-frequency variants, is that a single-variant analysis has insufficient power due to low MAFs and that one has to aggregate multiple rare variants to detect significant associations. Implicit assumptions behind this rationale include: the sample size is limited, a variant is extremely rare, or its effect size is moderate to small. However, in the case of low-frequency variants with large effect sizes, the single-variant analysis can still achieve genome-wide significance, in particular, when the disease-causal variant, either chip-genotyped or in silico-imputed, is examined. For example, in three recently published Icelandic population-based whole-genome sequencing/imputation studies, a rare missense variant c.2161C>T (MAF = 0.0038) in MYH6 was found to be associated with sick sinus syndrome (OR = 12.53, P-value =1.5 × 10−29) (Holm et al., 2011); a rare frameshift mutation c.2040_2041insTT (MAF = 0.0041) in BRIP1 was found to be associated with ovarian cancer (OR = 8.13, P-value =2.8 × 10−14) (Rafnar et al., 2011); and a low-frequency missense variant c.1580C > G (MAF = 0.019) in ALDH16A1 was found to be associated with gout (OR = 3.12, P-value =1.5 × 10−16) and serum uric acid levels (P-value =4.5 × 10−21) (Sulem et al., 2011). Note that all these three novel variants were identified by the single-variant analysis.

Given a small sample size, the power to detect association for a single low-frequency variant is limited even though the LR test is employed. However, with large consortia set up nowadays, true association, in particular, for those disease-causal variants with large effect sizes, is still likely to be distinctive. For example, the ApoB sequencing study on the 192 individuals, referred to in the Methods, was based on the cardiovascular cohort of the Malmö Diet and Cancer Study consisting of ∼6000 individuals (Kathiresan et al., 2008). There were two variants (S3203Y and P1143S) having six copies in one tail but not in the other, and the single-variant LR test produced a P-value of 3.53 × 10−3. Suppose they were causal variants with large effect sizes and their expressivity was stable, then a joint analysis of three, four and five similar studies would generate Fisher's combined P-values (Fisher, 1925) of 7.1 × 10−6, 3.4 × 10−7 and 1.7 × 10−8, respectively. Thus, even without a single cohort as large as the Icelandic studies, genome-wide significance can still be achieved for low-frequency disease-causal variants by the consortia already established. Note, however, that such signals could be missed if Wald's test were employed.

Based on the data from a pooled sequencing study, we performed proof-of-principle simulations to demonstrate the characteristics of two multi-variant analysis approaches – the burden test and the C-alpha test – and the single-variant LR test. The simulation settings were chosen not for the intention of undermining the power of multivariant analysis approaches, but for the purpose of characterizing each method. The power of the burden test is sensitive to the mean effects of multiple variants, whereas the power of the C-alpha test is sensitive to the dispersion of the effects; on the contrary, the single-variant LR test is robust to the distribution of multiple variants’ effects, but is sensitive to the largest effect size among the variants. Simulation studies (Neale et al., 2011; Wu et al., 2011) had shown that multivariant analysis on sequencing data required a sample size of at least thousands to attain the genome-wide significance level (10−6) and that the power was dependent on the distribution of effects of multiple variants. This might be the reason why no novel gene/locus has yet been reported for complex traits as a result of analysing sequencing data using the multivariant analysis approaches. Even for the extremely large gene ApoB, which is composed of 4536 amino-acid residues and is known to be associated with lipid levels, selective genotyping of 192 individuals with extreme phenotypes from a cohort of ∼6000 and analysis using the C-alpha method only generated a mean P-value at the level of 10−3 based on our simulation of 1000 replicates. We speculate that, similar to the GWA studies, analysis of a large sample size, made possible by the formation of large consortia, will still be the key to success in the sequencing era. Meanwhile, as long as there is a variant with a large effect size and stable expressivity, the single-variant analysis approach has the potential to identify it, as demonstrated by the Icelandic studies (Holm et al., 2011; Rafnar et al., 2011; Sulem et al., 2011).

In summary, in this study, we have shown that the statistical weaknesses of Wald's test are not merely a side note, but are likely to be a significant issue in many realistic situations. Our results support the use of alternative approaches, particularly LR tests, which are not susceptible to such problems, especially as ongoing advances in computational capabilities continue to reduce obstacles to intensive analyses.

Acknowledgements

We thank the reviewers for constructive comments which tremendously improved the manuscript, as well as Dr. Nathan Morris for critically reading, commenting and editing an early version of the manuscript. We also thank Dr. Helen Hobbs for granting us permission to use the DHS data. This study is supported by the American Heart Association Scientist Development Grant (No. 10SDG4220051) to C.X.

Ancillary