A Simple and Fast Two-Locus Quality Control Test to Detect False Positives Due to Batch Effects in Genome-Wide Association Studies

The impact of erroneous genotypes having passed standard quality control (QC) can be severe in genome-wide association studies, genotype imputation, and estimation of heritability and prediction of genetic risk based on single nucleotide polymorphisms (SNP). To detect such genotyping errors, a simple two-locus QC method, based on the difference in test statistic of association between single SNPs and pairs of SNPs, was developed and applied. The proposed approach could detect many problematic SNPs with statistical significance even when standard single SNP QC analyses fail to detect them in real data. Depending on the data set used, the number of erroneous SNPs that were not filtered out by standard single SNP QC but detected by the proposed approach varied from a few hundred to thousands. Using simulated data, it was shown that the proposed method was powerful and performed better than other tested existing methods. The power of the proposed approach to detect erroneous genotypes was ∼80% for a 3% error rate per SNP. This novel QC approach is easy to implement and computationally efficient, and can lead to a better quality of genotypes for subsequent genotype-phenotype investigations. Genet. Epidemiol. 34:854–862, 2010. © 2010 Wiley-Liss, Inc.


INTRODUCTION
Genome-wide association studies (GWAS) have been successful in contributing to the dissection of the genetic architecture of complex diseases [WTCCC, 2007]. A crucial step in the analysis of GWAS data is genotype quality control (QC) [Fardo et al., 2009;Mitchell et al., 2003;Pearson and Manolio, 2008]. Among standard QC measures are the exclusion of single nucleotide polymorphisms (SNPs) with low call rates (e.g. o95%) and minor allele frequencies (e.g. o1%), excluding loci due to departures from Hardy-Weinberg equilibrium (HWE) and excluding samples with high missing genotype rates. QC may become more important for more sophisticated analyses including multiple SNP analysis across the whole genome, estimation of heritability from a covariance structure derived from genome-wide SNP data, or prediction of genetic risk based on a set of SNPs that best explains genetic variation [Hoggart et al., 2008;Lee et al., 2008;Manolio et al., 2009;Visscher, 2008;Visscher et al., 2006;Wray et al., 2007].
Erroneous genotypes can be generated from random and/or nonrandom calling errors due to technical problems, DNA quality, DNA sample contamination, or systematic, constant unknown factors such as copy number polymorphisms. Frequently, genotype data from case and control samples are processed separately in different batches due to geographic or time differences when the samples were collected. Therefore, genotype miscalls and errors can be systematically associated with batch effects. Single SNP QC analyses or manual review of cluster plots may identify loci affected by batch effects. SNP QC analyses are routinely performed and implemented in software packages such as PLINK [Purcell et al., 2007]. Routine whole-genome QC analyses using multiple SNPs are less common.
In this report, a simple two-locus linear model analysis is used to detect SNP genotyping errors that are likely due to batch effects. The proposed approach can detect problematic SNPs with statistical significance even when standard single SNP QC analyses fail to detect them. With this novel approach, which is easy to implement and computationally fast, genotype quality can be improved.

RATIONALE
Standard single SNP QC analyses were performed on genome-wide SNP data from women recruited in Australia for a study of endometriosis [Treloar et al., 2002[Treloar et al., , 2005 and Australian controls genotyped in the same laboratory in a separate project [Ferreira et al., 2010;Medland et al., 2010]. A quantile-quantile (QQ) plot from preliminary analysis of single SNP association provided no evidence for systematic deviation from the null r 2010 Wiley-Liss, Inc.
distribution of no association. We then fitted two-locus models to see if they would provide stronger evidence for association than the single-locus models. Unexpectedly, we found highly significant signals for pairs of SNPs in linkage disequilibrium, well beyond what might be expected for true association signals. We investigated the reason for this substantial difference between the single and multiple SNP test and concluded (see below) that the unusual difference between the tests was because of genotyping errors and batch effects.
As an example, Figure 1 shows a genotype cluster plot for a SNP (called SNP1) that passed standard QC before inspection of cluster plots, but gave an unusually large test statistic for a two-locus test of association. The SNP has four genotype clusters in both cases and controls, but in the control group the second cluster was called as homozygous, whereas in the case group it was called as a heterozygote. In Table I, the genotype frequencies of SNP1 are shown. The total number of samples was 4,010. Using a linear model, the single SNP test for association gave a log-likelihood ratio (LR) of 20.05 (P-value 5 7.5eÀ06). When considering an adjacent SNP (SNP2), the genotype frequencies in the controls were slightly different from those of SNP1 and the single-locus association test gave a LR of 1.32 (P-value 5 0.25) for SNP2 (Table I). For an additive model, we expect that the association test for the combined genotypes of SNP1 and SNP2 (diplotypes) would give a similar value to the sum of the LR for each single SNP test, i.e. LR(SNP1,SNP2)ELR(SNP1)1 LR(SNP2). However, the association test for combined genotypes of SNP1 and SNP2 gave a LR of 124.06 (P-value 5 1.15eÀ27) ( Table I). One combination of the genotypes for SNP1 and SNP2, (T,T) (G,T), was 24 times more frequent in the controls than in the cases (Table I). Given the cluster plots in Figure 1, it is likely that the genotype (T,T) for SNP1 associated with the genotype (G,T) for SNP2 was generated due to genotype call errors. This hypothesis is supported by a haplotype analysis. The frequency of the haplotype TT was much higher in the controls than in the cases, resulting in high significance for a haplotype association test (P-value 5 8.36eÀ29) (Table I). However, a combination of the haplotype TT and TT was not observed at all in this sample, which strongly supported that the haplotype TT was artificially generated.
Since both SNPs passed standard QC, we aimed to develop a novel QC test to detect such genotyping errors and thereby reduce false positives due to batch effects.

METHODS
Linear models were used to detect potential false positives due to genotyping errors. A null model without any SNP effect is, where y is a vector of N phenotypic observations, m is the overall mean, 1 N is a vector of N ones and e is a vector of residuals which are assumed to be normally distributed with mean zero and variance s 2 e . In case-control studies, y has values of e.g. 0 and 1, whereas for quantitative traits the y values are continuous. When testing the effect of the ith SNP, the following additive model can be used, where X i is a vector of coefficients 0, 1, or 2 representing indicator variables of the genotypes of the ith SNP, and a i is the effect of the ith SNP. When considering joint effects from the ith SNP and jth SNP, the model can be with X j a vector of the genotype coefficients for the jth SNP, and a j the effect of the jth SNP. The natural logarithm of the likelihood given the model parameters for (2) is, assuming that the residuals are normally distributed, ln Pðy;X i jm;a i ;s 2 e Þ ¼ À The likelihood function for (1) and (3) can be accordingly derived (not shown). The logarithm of the likelihood ratio (LR) for the model fitting the ith SNP effect compared to the null model, is LRðSNP i Þ ¼ À2fln Pðy;X i jm;a i ;s 2 e Þ Àln Pðyjm;s 2 e Þg: ð5Þ The LR for the model fitting joint effects from the ith SNP and jth SNP, compared to the null model, is defined as, From (5) and (6), an adjusted likelihood ratio (LR Ã ) can be constructed for SNP i to test the difference in fit between the two-locus and single-locus models, Pðy;X i ;X j jm;a i ;a j ;s 2 e Þ Àln Pðy;X i jm;a i ;s 2 e Þg: Under the assumption of normality, the above LR test statistics are simple functions of the residual sums of squares from least squares linear regression. The value for LR Ã (SNP i ) is w 2 -distributed with one degree of freedom when the SNPs are uncorrelated and in the absence of genotyping artifacts. To implement the LR Ã (SNP i ) test genome-wide, a sliding window approach was taken in which pairs of adjacent SNPs were used in (7).

SIMULATION STUDY
Coalescence simulations were carried out using the program ''ms'' [Hudson, 2002] to generate SNPs under a Fisher-Wright equilibrium model. Parameters for the simulation of 2-Mb regions of the genome were chosen consistent with an effective population size N e 5 10,000, a mutation rate t 5 10eÀ08/site/generation, and a recombination rate r 5 10eÀ08/base pair/generation. Therefore, the parameters used for the program were y 5 4 N e ÃtÃ 2,000,000 5 800, and r 5 4 N e ÃrÃ2,000,000 5 800. For each replicate, 109 polymorphic sites were selected such that the average marker interval was $0.01 Mb, and the minor allele frequency (MAF) for each SNP was more than 0.05. The number of samples was 10,000, and 1,000 cases and 1,000 controls were randomly selected for the analysis.
For simulating higher levels of LD between SNPs, the effective size, mutation rate or/and recombination rate were reduced. We used y 5 r 5 560, 320, or 80 for alternative levels of LD.
Genotyping error model. Genotyping errors were introduced to every 10th SNP. Therefore, 10 SNPs of 109 SNPs had erroneous calls in each replicate. For the analysis of statistical power to detect genotyping errors, the 10 SNPs with errors and adjacent SNPs were used. The 10 SNPs with errors were in approximate linkage equilibrium.
Six kinds of error models were considered (Table II). For error models 1-4, 10% of either the cases or controls were randomly selected and their genotypes altered. In the first error model (error 1), if a selected individual was heterozygous, the genotype was switched to homozygous for the major allele. In the second error model (error 2), if a selected individual was homozygous for the major allele, the genotype was switched to heterozygous. In the third error model (error 3), if a selected individual was heterozygous, the genotype was switched to homozygous for the minor allele. In the fourth error model (error 4), if a selected individual was homozygous for the minor allele, the genotype was switched to heterozygous. In the fifth and sixth error model, genotyping errors occur in both case and control groups with different directions mimicking the observed example ( Fig. 1), and 5% of the individuals were randomly selected in each group and their genotypes manipulated. In the fifth error model (error 5), heterozygous genotypes for selected individuals were switched to homozygous for the major allele in one group, whereas homozygous genotypes for the major allele for selected individuals were switched to heterozygous in the other group. In the sixth error model (error 6), heterozygous genotypes for selected individuals were switched to homozygous for the minor allele in one group, whereas homozygous genotypes for the minor allele for selected individuals were switched to heterozygous in the other group.
Power calculation using simulated data. For comparisons, four existing methods were used to calculate the power using simulated data. First, the single SNP association test (ssa) was used. Second, the linear modelbased QC (lmq), the proposed method in this study, was Homozygous for minor allele miscalled as heterozygous in random 10% of either cases or controls Error 5 Heterozygous miscalled as homozygous for major allele in random 5% of cases, and homozygous for major allele miscalled as heterozygous in random 5% of controls Error 6 Heterozygous miscalled as homozygous for minor allele in random 5% of cases, and homozygous for minor allele miscalled as heterozygous in random 5% of controls used. Third, a haplotype association test (hap) was used where haplotypes for pairs of adjacent SNPs were estimated using an expectation-maximization algorithm, and an association test for the haplotypes was carried out [Purcell et al., 2006[Purcell et al., , 2007. Finally, a LD-contrast (ldc) method was used to test the significance of the difference for r 2 (LD measure) of a pair of adjacent SNPs between cases and controls [Zaykin et al., 2006]. The computer programs PLINK [Purcell et al., 2007] and LD-contrast [Zaykin et al. 2006] were used for the hap and ldc analyses, respectively. The type-I error was calculated from 10,000 tests with simulations of no genotyping errors. Empirical power was calculated from 1,000 tests with simulations for the genotyping error models, using the type-I error thresholds generated under the simulations without genotyping errors.

REAL DATA
The lmq test was further applied to real data. We analyzed genome-wide SNP data from a population-based case-control study for endometriosis [Painter et al., 2010, submitted;Treloar et al., 2002]. Our main interest was to detect possible false positives due to batch effects. For comparison, two independent control samples from Welcome Trust Case Control Consortium (WTCCC) were used for the analysis [WTCCC, 2007]. One of the two control groups was treated as a case group and the other was treated as a control group in the analysis.
For both data sets, the same standard QC analysis was performed. SNPs with MAFso0.01, missing rates 40.05, and P-values for HWE o0.0001, and individuals with missing rates 40.01 were filtered out. Sex chromosomes were excluded from the analysis. After the QC process, 4,083 individuals (2,247 cases and 1,836 controls) with 496,733 SNPs were used for the analysis of the endometriosis data, and 2,439 individuals (1,186 cases and 1,253 controls) with 385,054 SNPs were used for the analysis of the WTCCC data.
Four different threshold values were used to filter out SNPs with high missing rates, i.e. 0.05, 0.02, 0.01, and 0.001. For each threshold value, the inflation factor of the test statistics, the number of inflated data points, and QQ plots for ssa and lmq were reported as indicators of data quality.

SIMULATION STUDY
Type-I error and power using simulated data.
When there was no genotyping error, the Type-I error rate was maintained at acceptable levels for all methods using simulated data with both situations of low LD (simulation parameters y 5 r 5 800) and high LD (y 5 r 5 80) (Tables III  and IV). For the situation of high LD, the values for the type-I error rate were slightly low for lmq and ldc. This is to be expected because when adjacent SNPs are in high LD, the single and pairwise SNP tests for association become more similar, so that their difference does not follow a w 2 with one degree of freedom. In the extreme case of pairs of SNPs being in perfect LD, the single and pairwise tests are identical and the type-I error rate assuming a w 2 would be zero. Generally, the most powerful approach was lmq for which the power ranged from 0.08 to 0.74 for low LD (Table III) and from 0.16 to 0.79 for high LD (Table IV). The powers for ssa and hap methods were similar to each other when using error models 1 to 4. However, when using error models 5 and 6, where genotyping errors occurred in both cases and controls, ssa was more powerful than hap. The power for ldc was relatively low.
Genotyping error rate and power. For the simulated data, the average MAF of the selected SNPs was $0.2. The genotyping error rate per SNP was estimated for each error model. The error rate was 1.7, 3, 1.7, 0.2, 2.3, and 1% for the error models 1, 2, 3, 4, 5, and 6, respectively. Figure 2 shows that the power generally increases when the genotyping error rate increases as expected. For lmq, the power increased up to 75% when the genotyping error rate was about 3% (Fig. 2).  Levels of LD and power. When using simulation parameters y 5 r 5 800, 560, 320, and 80 with the error model 2 (homozygous for major allele switched to heterozygous), the estimated r 2 for adjacent SNPs was 0.20, 0.22, 0.25, and 0.33, respectively (Fig. 3). The power slightly increased when the level of LD increased, e.g. for lmq, the power increased from 0.74 to 0.79 (Fig. 3), but this increase was modest.

REAL DATA
For the endometriosis data, the overall quality of the genotypes was high. As expected, the proportion of excluded SNPs gradually increased when the threshold value for missing rates per SNP decreased (Table V). However, even with a stringent threshold value, a relatively small number of SNPs were excluded, i.e. 0.2, 0.6, and 8.6% of 496,733 SNPs for threshold values of 0.02, 0.01, and 0.001 (Table V). For the WTCCC data, the proportion of excluded SNPs was relatively large, i.e. 5, 13, and 48% of 385,054 SNPs for threshold values of 0.02, 0.01, and 0.001 (Table V). Figure 4 shows QQ plots for the ssa tests (left column) and for the pairwise lmq test (right column) in a vertical order of the applied threshold 0.05, 0.02, 0.01, and 0.001 when using the endometriosis data. With the threshold values of 0.05, 0.02, or 0.01, the patterns of the plots for ssa differed from those for lmq. For ssa, the plots showed a very good agreement between the expected and observed values and the inflation factor (l) was close to one, and few data points deviated from expectation (data points above the horizontal line) (Table V). However, the plots for lmq showed that observed values deviated from expectation for a relatively large number of points ( Fig. 4 and Table V). Deviation from expectation is likely to occur because of genotyping errors due to batch effects. More stringent threshold values for missing rates per SNP reduced the number of data points that deviated from expectation. There were few data points that deviated from expectation with the threshold of 0.001 ( Fig. 4 and Table V).
When using the WTCCC data, the number of data points that deviated from expectation was large unless SNPs having missing rates of more than 0.01 were filtered out (Fig. 5). The difference between the results from ssa and lmq was remarkable. With the threshold value of 0.05, the number of data points deviating from expectation was 5,019 (1.3% of the corresponding number of SNPs) for ssa and 29,275 (7.6%) for lmq (Table V). With lower threshold values, the number of SNPs that deviated from expectation was reduced. With the threshold value of 0.01, there was a good agreement between the expected and observed values for ssa. However, there was still significant deviation for lmq. For example, the number of data points that deviated from the expectation was 2,375 (0.7%). This indicated that there were still genotyping errors due to batch effects although the proportion was low. Few data points deviated from expectation when using a stringent threshold value of 0.001 ( Fig. 5 and Table V). The inflation factor (l) was close to 1.0 for lower threshold values for both ssa and lmq, although the pattern was more obvious for lmq.

DISCUSSION
We have developed and applied a novel lmq method, and showed that it could detect many of the problematic SNPs that already passed standard single SNP QC filters. Using simulated data, we showed that the lmq test was reasonably powerful and performed well in comparison with other methods.
In GWAS, the impact of erroneous genotypes can be severe on false-positive SNP-phenotype associations,  estimation of heritability, and prediction of genetic risk. Erroneous SNPs with high significance may be accepted as genuine if they have already passed a standard QC process although manual inspection of cluster plots will detect some problematic SNPs. Even if the significance of SNPs with erroneous genotype calls is low enough to be ignored, accumulated information from such SNPs can cause high significance for haplotype-based association tests, overestimation of relationships between individuals within each case and control group, and inaccurate and Fig. 4. QQ plots of the negative logarithm of P-value when using endometriosis data. The left column is for ssa and the right column is for lmq in a vertical order of the applied threshold 0.05, 0.02, 0.01, and 0.001. The horizontal line is determined such that the difference between the negative logarithms of the expected and observed P-value above the line is greater than 0.2. QQ, quantile-quantile.
imprecise prediction of genetic risk [Becker et al., 2006;Liu et al., 2006;Marquard et al., 2009]. Furthermore, metaanalysis from large multi-centre studies relies heavily on genotype imputation to create a set of genotypes (or dosage scores) that can be compared across studies.
Genotyping errors such as detected by our method could, if undetected, lead to an increased rate of false positives and false negatives. We used a simple linear model assuming that the residuals were normally distributed. In a more formal way, Fig. 5. QQ plots of the negative logarithm of P-value from when using WTCCC data. The left column is for ssa and the right column is for lmq in a vertical order of the applied threshold 0.05, 0.02, 0.01, and 0.001. The horizontal line is determined such that the difference between the negative logarithms of the expected and observed P-value above the line is greater than 0.2. QQ, quantile-quantile.
a non-linear model such as a logistic regression could be used to test the difference in fit between the two-locus and single-locus models. However, the obtained LR values between a linear and logistic model were very similar (result not shown). We prefer the simple linear model because of its computational efficiency.
Although we did not simulate a disease-association model, we applied the method to the endometriosis data. For those data, we observed that the SNPs detected by the two-locus QC were obviously problematic in their cluster plots and GenTrain scores. Conversely, we observed that the lmq test statistics for SNPs having high significance for diseaseassociation with clear cluster plots and high GenTrain scores were not inflated compared to the ssa test statistic (the best 10 SNPs were checked). This indicated that at least for these data the two-locus QC test performed properly well, in that it detected false positives (as evidenced by a closer inspection of the SNP quality scores) yet did not flag SNPs that appear to be associated with disease.
The two-locus QC detects hidden (rare) haplotypes across two loci that are associated with disease status. It is true that such haplotypes could be generated due to genuine causal variants (e.g., recent mutations). To minimize false positives or discarding truly associated SNPs, we used a composite strategy to control data quality, i.e. removing problematic SNPs based on SNP missing rates and assessing data quality by the lmq statistics (Table V and Figs. 4 and 5).
In order to remove problematic SNPs, it should be safe and reasonable to rely on the information of SNP missing rates as low-quality genotypes often result in high missing rates (low call rates) [Fu et al., 2009]. We used four thresholds for missing rates per SNP and indicated that the statistical signals for the presence of erroneous genotypes clearly reduced with more stringent thresholds. A high restriction allowing only minimum missing rates may exclude SNPs that do not suffer from genotyping errors. However, the loss of information is not large. For example, for the endometriosis data, the restriction of missing rates per SNP less than 0.01 excluded only $1% of 496,733 SNPs that passed a preliminary standard QC.
We checked the performance of the proposed approach compared to GenTrain and cluster separation scores that can be obtained from the Illumina Beadstation software. We investigated the most problematic 11 pairs of SNPs according to their LR from the lmq test using the endometriosis data. Usually, only one SNP of each pair appeared to cause the inflated test statistic. Details of these 11 SNPs are listed in Table VII. In general, the GenTrain scores were reasonably high for all the SNPs (above 0.7). This was expected because the listed SNPs already passed standard QC. The cluster separation scores were low for most of the problematic SNPs, but high for rs4621162 and rs10810402. Interestingly, the missing rates were relatively high (40.01) for all the flagged SNPs.
Are there other simple tests readily available that would detect the same problematic SNPs? The versatile program PLINK has a number of functions that would be implemented for QC. For example, the function loopassoc can be used to test for an association between gene frequency and plate. The PLINK function test-missing tests the difference of missing rate per SNP between cases and controls. We applied this function to the endometriosis data (Table VI). We observed that a large proportion of problematic SNPs detected by our two-locus test showed no significance for the difference of missing rate. It is noted that the PLINK-test-missing function may not be able to pick up differential haplotypic missingness such as error models 5 and 6. On the other hand, a large proportion of SNPs not identified by our two-locus test gave significance for the difference of the missing rate between cases and controls. There were not many common SNPs that were picked up by both the proposed method and PLINK test-missing function (Table VI). Therefore, at least for the endometriosis data, the test for differential missingness  The total number of SNP is 496,733, and the number of SNP with a missing rate 40 for both cases and controls is 258,628.
across cases and controls identifies a different subset of potential problematic SNPs than our proposed method. We used a sliding window approach fitting pairs of SNPs starting from one end of the chromosome. We checked how stable the results were if it started from the other end using the endometriosis data. When carrying out the lmq analysis with a sliding window from pter (short arm telomeric end) to qter (long arm telomeric end) or from qter to pter, the test statistic inflation factor l and the number of data points deviating from expectation for each threshold value were very similar (Table VIII). This indicates that the method appears robust to sliding window orientation.
In conclusion, we propose a simple and fast two-locus test to detect SNPs among those that passed single-SNP QC that are likely to suffer from batch effects. The proposed test harvests haplotypic information to provide a sensitive approach to identify differential technical artefacts across genotyping projects/batches that is not detected using current standard QC of GWA data. In combination with standard single-SNP QC analyses, our test may contribute to a better quality of the genotypes used for subsequent genotype-phenotype investigations.