Allele balance bias identifies systematic genotyping errors and false disease associations

Abstract In recent years, next‐generation sequencing (NGS) has become a cornerstone of clinical genetics and diagnostics. Many clinical applications require high precision, especially if rare events such as somatic mutations in cancer or genetic variants causing rare diseases need to be identified. Although random sequencing errors can be modeled statistically and deep sequencing minimizes their impact, systematic errors remain a problem even at high depth of coverage. Understanding their source is crucial to increase precision of clinical NGS applications. In this work, we studied the relation between recurrent biases in allele balance (AB), systematic errors, and false positive variant calls across a large cohort of human samples analyzed by whole exome sequencing (WES). We have modeled the AB distribution for biallelic genotypes in 987 WES samples in order to identify positions recurrently deviating significantly from the expectation, a phenomenon we termed allele balance bias (ABB). Furthermore, we have developed a genotype callability score based on ABB for all positions of the human exome, which detects false positive variant calls that passed state‐of‐the‐art filters. Finally, we demonstrate the use of ABB for detection of false associations proposed by rare variant association studies. Availability: https://github.com/Francesc-Muyas/ABB.


INTRODUCTION
The rapid improvement of next-generation sequencing (NGS) throughput and cost has changed biomedical research as well as clinical diagnostics of genetic diseases and cancer (Altmann et al., 2012). Numerous genome-sequencing projects catalogued millions of frequent and rare variants, some of which are associated to disease (Auton et al., the studied event is rare, such as de novo germline mutations, the likelihood to observe false positive (FP) calls is further increased (Gómez-Romero et al., 2018;Veltman & Brunner, 2012). Moreover, rare variant association studies (RVAS) can generate false results if genes are enriched with sequencing or alignment errors, leading to false associations to the studied disease (Hou et al., 2017;Johnston, Hu, & Cutler, 2015;Yan et al., 2016). Hence, some RVAS methods take into account error probabilities (He et al., 2015) or bypass genotype calls completely by directly modeling sequencing reads (Hu, Liao, Johnston, Allen, & Satten, 2016). The impact of false genotype calls is amplified in the study of recurrent somatic mutations in cancer, particularly, if ultra-deep sequencing is used to identify sub-clonal mutations with low minor allele frequency (Cai, Yuan, Zhang, He, & Chou, 2016). Moreover, recent benchmarking studies reported substantial disagreement between somatic SNV and indel prediction methods (Alioto et al., 2015).
Although most variant calling algorithms can deal with random sequencing errors, systematic errors have mostly been neglected in the past and thus more often lead to FP variant calls. Several causes of errors of sequencing by synthesis-based platforms are well described, such as crosstalk and dephasing (Ledergerber & Dessimoz, 2011;Pfeiffer et al., 2018;Sleep, Schreiber, & Baumann, 2013), missed nucleotides in low-complexity regions (H. Li, 2014), index hopping ("bleeding") (Vodák et al., 2018), and DNA damage during library preparation (Chen, Liu, Evans, & Ettwiller, 2017) caused by, for example, 8-oxo-G formation when using acoustic shearing or oxidative stress during probe hybridization (Newman et al., 2016;Park et al., 2017) and decreased coverage in regions of very high or low CG content (Sleep et al., 2013). Li (2014) showed that a large fraction of systematic errors found in variant callsets were not due to sequencing errors, but erroneous realignments in low-complexity and repetitive regions (about 2% and 45% of the human genome, respectively), as well as the incompleteness of the reference genome with respect to the analyzed sample. While repetitive regions lead to ambiguous alignments of short reads and thus increase the likelihood of assigning a read to the wrong locus, low-complexity regions also cause misalignment of reads at the correct position (Cordaux & Batzer, 2009;H. Li, 2014). Furthermore, many aligners tend to misalign indels close to the end of a read, as their scoring function favors mismatches over gap openings.

Strategies for identification of systematic genotyping errors
A multitude of variant calling algorithms that apply various strategies to reduce the false positive rate (FPR) have been developed. Freebayes (Garrison & Marth, 2012), or Varscan (Koboldt DC, Larson DE, 2013;Koboldt et al., 2009). Other tools, for example, Strelka (Saunders et al., 2012), VarScan2, and MuTect (Cibulskis et al., 2013), specialize in somatic mutation calling using tumor-normal pairs. Most of these methods apply Bayesian statistics (e.g., Bayesian classifiers) to compute genotype likelihoods (Garrison & Marth, 2012; Van der Auwera et al., 2013), or, in case of somatic mutations, the likelihood of the variant model (Cibulskis et al., 2013). Some systematic alignment issues can be addressed by haplotype-based variant calling as performed by FreeBayes (Garrison & Marth, 2012). Issues with gapped alignments around indel alleles are tackled by alignment postprocessing, using either multiple-read realignment or local assembly (DePristo et al., 2011;Van der Auwera et al., 2013). Still, stringent post-filtration of callsets using machine learning-based error models (e.g., Variant Quality Score Recalibration [VQSR] (Carson et al., 2014;Van der Auwera et al., 2013)) or thresholds on various call quality statistics (e.g., genotype quality, read depth, variant allele frequency (VAF) (Li, 2014)), clustered variants, Fisher strand bias (Guo et al., 2012) is a necessity. Other strategies include removal of variants in low-complexity regions as well as in repeats incompletely represented in the reference genome (typically indicated by significantly increased read coverage) (Carson et al., 2014;Li, 2014). However, a general issue of many post-filtration strategies is their use of hard thresholds for the various quality metrics, where small changes can dramatically influence false negative and false positive rates, or their dependence on large sample sets to be effective (e.g., VQSR) (De Summa et al., 2017;Lek et al., 2016).
Here, we present a new strategy to identify systematic sequencing or alignment errors leading to false variant calls, which is based on the recurrent and significant deviation of observed to expected allele balance (AB) in a genomic position across large control cohorts. This signature, termed allele balance bias (ABB), was found in 0.03% of all exonic positions, 4% of high confidence germline SNV calls in 987 exomes, and 8% of somatic SNV calls in 200 tumor-normal pairs. We present two algorithms: one for computing ABB for all positions of the exome (or genome) using large cohorts of WES (or WGS) samples, and one for refining candidate gene lists generated by RVAS. We have trained an ABB model for the human exome optimized for the use in clinical exome diagnostics and rare variant association testing in coding genes. Finally, we provide ABB genotype callability scores for all positions of the human exome. or HiSeq2500 using 2 × 100 bp paired end reads (Bentley et al., 2008). Reads were aligned against the human reference genome (hg19) using BWA-MEM (Li, 2013;Li & Durbin, 2009). Alignment post-processing was performed according to GATK best practice guidelines (Van der Auwera et al., 2013), including PCR duplicate marking, Indel realignment, and base quality recalibration (Bao et al., 2014). Variant calling was performed using GATK HaplotypeCaller v3.3(McKenna, 2009;Van der Auwera et al., 2013). Variants with genotype quality below 20 or Fisher strand bias (FS) in the top 10 percentile were removed. For benchmarking purposes, we generated two callsets, one with and one without applying GATK VQSR filter (tranche threshold of 99.9%). See Accesion Numbers section for available data.

Deviation of observed from expected AB
We investigated the relationship between recurrent deviation of observed from expected AB, systematic errors and FP SNV calls in whole-exome sequencing (WES) data. AB describes the fraction of reads supporting the alternative allele in a focal position (AB = alternative read count/total read count at focal position). When sequencing diploid species, heterozygous genotypes are expected to show an AB close to ∼0.5. We modeled read distribution for heterozygous genotypes using a binomial distribution Binomial(D, ∼0.5) (Guo et al., 2013;Nothnagel et al., 2011;O'Fallon, Wooderchak-Donahue, & Crockett, 2013). Homozygous genotypes are expected to have close to 100% of reads supporting the same allele, with the amount of deviating reads depending on the sequencing and alignment error rate and other variables. We modeled the expected read distribution for homozygous reference using zero inflated beta distribution and for homozygous alternative using one inflated beta distribution, where AB would be inside the range [0, 1] (Ospina & Ferrari, 2012). The corresponding probability density function is given by beinf (y; , , , ) = (1 − )f(y; , ), if y ∈ (0, 1), where f (y; , ) is the beta density function, and and are the parameters that define the shape of the beta distribution. Note that, if y ∼ BEINF( , , , ), then P(y = 0) = (1 − ) and P(y = 1) =

ABB-based genotype callability model
To integrate the three measures of ABB into one genotype callability score, we trained a logistic regression model using the variant calls RdAB3 as features to predict the labels obtained by Gaussian Mixture Modeling. It returns the probability of a variant site belonging to the label 1 (recurrently deviated AB positions) using the R function glm with family = "binomial": with y = 1 (recurrently deviated positions).
Subsequent to the estimation of the logistic regression parameters i , we calculated F1 scores at different probability levels and chose the maximum as optimal cutoff to assign labels. For this cutoff, we calculated precision, recall, F score, and FPR (see formulas in Support- and very low confidence (ABB > 0.9) positions. We interrogated the enrichment of very low confidence sites in somatic variant calls using tumor-normal paired data from 200 CLL patients, whose normal sample had not been used for ABB model building (germline variant positions used in model building were also excluded from enrichment analysis). Somatic SNVs were predicted using MuTect (Cibulskis et al., 2013). We measured the enrichment of high, medium, low and very low confidence sites in somatic mutation calls compared to their exome-wide expectation and the enrichment of each quality bin in Cosmic and dbSNP using in both cases Chi-Square

Evaluation of ABB by Sanger sequencing
Pearson's test.

RESULTS
We have developed a genotype callability score for NGS analysis based on the recurrent deviation of observed from expected AB, termed ABB. Using an ABB model trained on 987 WES datasets, we precomputed ABB genotype callability scores for more than 81 Million positions of the human exome. We did not observe biases in genotype  Figure S1). To evaluate the performance of ABB on identification of systematic errors and FP genotype calls, we used an independent set of 210 WES cases and Sanger validation.
In addition, we demonstrate that ABB correlates with various measures of sequencing and alignment errors and show that public variant databases are enriched for systematic genotyping errors.

Training and evaluation of the ABB model
We hypothesized that systematic sequencing or alignment errors lead  Table S4).

ABB genotype callability filter for germline and somatic variant calling
To evaluate if the use of ABB as genotype callability filter leads to improved variant callsets, we applied an ABB very low confidence filter (ABB > 0.9) to variants predicted by GATK HaplotypeCaller with VQSR filtering. Using a callset for 10 samples not used during ABB model training, evaluation, or testing, we found that 13,168 out of 346,894 (3.80%) variant sites overlapped with ABB very low confidence sites (compared with 0.025% of all exonic positions, P value < 10 −16 ; Table 1 and Supporting Information Figure S2), with an average of 1,317 TA B L E 1 Distribution of ABB genotype callability levels in the whole exome, germline SNV calls, and somatic SNV calls   positions compared to 3.8% observed for germline variant calling (P value < 10 −16 ) and the exome-wide expectation (P value < 10 −16 ).

ABB callability
Interestingly, 45.38% of the very low confidence mutations were found in dbSNP. This proportion was significantly higher (P value < 10 −16 ) than the fraction of high confidence somatic mutations in dbSNP (9.51%; Table 2), pointing at a systematic introduction of errors in dbSNP.
To demonstrate that our model is not falsely labeling real somatic mutations as systematic errors we intersected positions marked as systematic errors (ABB >= 0.9) with 1896 somatic mutations validated in two studies (Papaemmanuil et al., 2014;Tarpey et al., 2013) and 341 well-known cancer driver mutation hotspots (Chang et al., 2016).
We found a minimal overlap of two out of 1,896 and zero out of 341, respectively. There was no significant difference between the fraction of systematic errors identified in the whole exome (Table 1) and the set of validated somatic SNV positions (P value = 0.1421), demonstrating that ABB does not misclassify true somatic mutations as systematic errors more than expected by chance (Supporting Information   Table S6).

Sanger sequencing-based evaluation of ABB scores
We next evaluated if ABB scores correlate with the probability of calling FP variants. To this end, we validated by Sanger a set of randomly selected 209 heterozygous SNPs predicted by GATK HaplotypeCaller with VQSR in 10 samples, which had AB between 0.2 and 0.35, and that were sampled equally from each of the four ABB genotype callability levels (see section Materials and Methods). We found that ABB genotype callability levels correlated with FPR ( Figure 2c, Table 3,   and Supporting Information Table S7). Furthermore, ABB scores were predictive of FP calls (ROC-AUC = 0.778; Figure 2b). Although the original variant callset produced by GATK HaplotypeCaller and VQRS could be considered high quality (tranche threshold of 99.9%), we found an FPR of 50% in the very low confidence set and 31% FPR in the low confidence set, while high and medium confidence positions showed only 0% and 15.4% FPR, respectively. Interestingly, the fraction of failed (ambiguous) Sanger sequencing experiments was significantly higher for the low confidence range when compared against high confidence range (P value < 0.025; Table 3), indicating that lowcomplexity regions and repeats constitute one of the underlying issues, as these also affect efficiency of Sanger sequencing (Kieleczawa, 2006 Figure S3)   Table S11). We observed a large margin between ABB for TPs (average of 0.115) and FPs (average of 0.666).

ABB scores of variants in public databases
Public variant repositories differ in the way included variants are called, quality controlled, and selected for integration. For instance, the 1000GP, ExAC/GnomAD, and EVS databases are created in a consistent manner, using a defined pipeline for all samples (Lek et al., 2016). However, dbSNP does not dictate any specific variant prediction method or quality control procedure and contains both germline and somatic variants. Hence, we hypothesized that although all variant databases may contain systematic errors, dbSNP is specifically affected by FPs due to its inconsistent quality parameters, as previously suggested (Musumeci et al., 2011). We found that very low confidence positions were significantly enriched in several public variant databases (all P values < 10 −16 ; see Table 4). As expected, we found the strongest enrichment of systematic errors (ABB > 0.9) in dbSNP (15.9 times more than expected). As many variant analysis pipelines use the same tools as employed for generating 1000GP, ExAC, GnomAD, or EVS, one should be cautious when considering variants found in these databases as validation gold standard for variants in newly generated callsets. As we expect to see systematic errors repeatedly, this circularity issue (validation using false variants predicted by the same tools) can lead to a "self-fulfilling prophecy," where false variants are established as true positives in public databases and potentially influence disease studies in the future.

Filtering candidate genes from RVAS
WES is frequently applied to identify causal variants for genetic diseases, using rare variant association tests in large cohorts or analysis of affected families and parent-child trios. Although ABB can be used generically to filter results of variant callsets, we have in addition developed a custom algorithm, Association-ABB, for identification of cohort-specific false associations caused by systematic errors (see section Materials and Methods). We hypothesized that false associations can be introduced in case-control studies due to (1) a bias in systematic errors between cases and controls, leading to an uneven burden of false variant calls, or (2) copy number variants enriched in cases or controls, for example, due to biased population structure. Therefore, we reanalyzed variants in candidate genes in order to identify associations better explained by biases in the burden of systematic errors.
The Association-ABB evaluation was performed on candidate genes resulting from an RVAS for CLL using WES of 437 CLL normal samples from ICGC-CLL (Quesada et al., 2011) and 780 control samples.
In  Table S12). In addition, these variants showed a high ABB score (average of 0.8670). We next performed a gene-wise aggregated test of biased sites and found that 10 out of 43 candidate genes were likely false associations (Supporting Information Table S13). In brief, we tested if the RVAS association was still significant when (1) biased sites were excluded or (2) potentially missed calls were added to the test (see section Materials and Methods).
One example gene, CTDSP2, is shown in Supporting Information Figure S7. We observed that a different AB distribution in cases and controls in 8 biased positions led to an imbalanced genotyping efficiency of GATK HaplotypeCaller, explaining the significant associa-tion of this gene with CLL in the RVAS test. Comparing called versus potentially missed SNVs, we found a significant enrichment of missed calls in the controls, that is, positions called as homozygous reference, although more than expected reads showed the alternative allele.
However, not only cases showed an enrichment of calls with significantly deviated AB in heterozygotes (AB around 0.25), but also controls that had been enriched using Agilent SureSelect, while controls prepared with NimblegenSeqEz were "clean" (Supporting Information Figure S8 and Supporting Material). We conclude that a systematic issue with few target regions of one enrichment kit introduced the false RVAS call.
The AB patterns between cases and controls in the gene CDC27 look similar to CTDSP2, as shown in Supporting Information Figure S9, although this enrichment was not associated to the capture method as CTDSP2 (see Supporting Information Figure S10). Moreover, literature search revealed that this gene frequently harbors FP SNVs (Jia et al., 2012), likely caused by multiple novel retroduplications (Abyzov et al., 2013). Indeed, we found that cases with deviating AB also showed significantly increased coverage on the exons affected potentially by retroduplications (Supporting Information Figure S11) (see Supporting Material for detailed explanation of this section).
Association testing when removing problematic site in CDC27 or CTDSP2 ("cleaning") or when adding potentially missed calls led to non-significant association tests. In summary, ABB identified retrotransposition as well as exome hybridization kit-related systematic errors causing false associations in an RVAS study of CLL.

DISCUSSION
In this work, we present a new genotype callability filter for exome or genome sequencing analysis, which is based on the recurrent and sig-  (Griffith et al., 2015). Indeed, we observed that close to 14% of somatic SNVs called by MuTect were classified as ABB low or very low confidence sites, a significantly larger fraction than observed for germline variant calling or expected on an exome-wide level. Moreover, these FP mutations are again highly enriched in dbSNP. Considering the importance of predicted point mutations for cancer diagnostics and optimal treatment selection, removal of these FP calls is essential for the applicability of NGS in precision oncology.
Systematic FP calls can lead to false associations of genes with disease. Using Sanger validation of disease gene candidates prioritized in previous projects, we demonstrated that high-confidence ABB sites are 100% true positives, allowing to reduce the cost of Sanger validation by omitting validation of these sites. At the same time, ABB identifies up to 65% of false candidates (considering low or very low confidence sites, up to 100% if also considering medium confidence as FP).
We further demonstrate how systematic errors resulting in false associations can be identified by Association-ABB in a cohort specific manner. We found that in a rare variant association test for CLL around 25% of candidate gene associations were better explained by uneven burden of systematic errors in cases and controls. We further hypothesized that systematic SNV calling errors were introduced by an unannotated CNV in at least couple of candidate genes, indirectly pointing to the real cause of the genotype-phenotype association.
The current ABB model has been built using alignments generated by bwa-mem. Hence, some systematic errors identified in our study might reflect specific alignment issues of bwa-mem and might not be observed when using, for example, bowtie2. However, bwamem is one of the most used aligners for human genomics, making our model directly applicable to a majority of projects using WES or wholegenome sequencing of human samples. Nonetheless, one could retrain the ABB model for specific computational analysis pipelines, for other species, for whole genomes or using thousands of additional samples, a process we support by offering all scripts for generating dedicated ABB models (see Availability section).
In summary, our novel genotype callability estimator based on ABB