Reanalysis and optimisation of bioinformatic pipelines is critical for mutation detection

Abstract Rapid advances in genomic technologies have facilitated the identification pathogenic variants causing human disease. We report siblings with developmental and epileptic encephalopathy due to a novel, shared heterozygous pathogenic 13 bp duplication in SYNGAP1 (c.435_447dup, p.(L150Vfs*6)) that was identified by whole genome sequencing (WGS). The pathogenic variant had escaped earlier detection via two methodologies: whole exome sequencing and high‐depth targeted sequencing. Both technologies had produced reads carrying the variant, however, they were either not aligned due to the size of the insertion or aligned to multiple major histocompatibility complex (MHC) regions in the hg19 reference genome, making the critical reads unavailable for variant calling. The WGS pipeline followed different protocols, including alignment of reads to the GRCh37 reference genome, which lacks the additional MHC contigs. Our findings highlight the benefit of using orthogonal clinical bioinformatic pipelines and all relevant inheritance patterns to re‐analyze genomic data in undiagnosed patients.

High throughput targeted sequencing of 65 epilepsy genes using molecular inversion probes (MIPs; Carvill et al., 2013), was performed on DNA from the family quartet (both girls and their parents) in 2014.
Note that 100 bp paired end reads were aligned to a custom hg19 reference genome containing only chromosomes 1-22, X, Y, chrM, using bwa sampe (v0.5.9-r16; Li & Durbin, 2009), standard settings, and variant analysis and filtration as described (Carvill et al., 2013). No   3. One of the MHC contigs included in the hg19 reference genome, chr6_ssto_hap7 includes SYNGAP1. The mapping quality score for reads that align to both chr6, and chr6_ssto_hap7 were penalized to zero (translucent reads) whereas reads that map only to chr6 have high mapping quality (grey reads) Information Table S1)   In the MIPs targeted sequencing data, good gene coverage of SYNGAP1 exon 5 was observed, with a read depth of >142× in the older sister (Figure 1b). Due to the use of an older read aligner (i.e., bwa sampe v0.5.9-r16), and the high ratio of 13 bp mutation to 100 bp read length, we hypothesized that the reads supporting the duplication may have been misaligned. The reads were re-aligned to the reference genome, increasing the maximum number of gap extensions (-e 20), which resulted in the reads that carried the variant being appropriately aligned from two overlapping amplicons (Figure 1c).
Updating the aligner to bwa mem v0.7.15 also correctly aligned the reads carrying the mutation using default settings (not shown).
In the WES dataset, we observed no reads across SYNGAP1 with mapping quality >20 (Figure 1d). Removing the mapping quality filter revealed an average read depth of 91× across exon 5 (Figure 1e), consistent with read alignment to multiple regions of the reference genome. Importantly this also identified that there were reads carrying the heterozygous pathogenic variant.
The hg19 version of the reference genome from UCSC contains seven additional sequences at the 6p13 locus, to capture the extensive genetic variation in the major histocompatibility complex (MHC) (Lam, Tay, Wang, Xiao, & Ren, 2015). SYNGAP1 is found centromeric to one of the common HLA haplotypes, A1-B8-DR3-DQ2 (Horton et al., 2008), represented by the chr6_ssto_hap7 contig ( Figure 1F). Consequently, the sequencing reads from SYNGAP1 mapped perfectly to both chr6p21.3 and chr6_ssto_hap7, and their mapping quality scores were set to zero (Figure 1f), making these reads invisible to variant identification tools. The pathogenic variant was identified using default detection settings once the reads were re-aligned to the GRCh37 reference genome that lacks the additional MHC contigs, or to GRCh38 with an 'alt-aware' read aligner, bwa mem (v0.7.12), which assigned the correct mapping quality scores to the reads.
In summary, we identified a shared heterozygous SYNGAP1  resolve some regions of the genome previously classified as inaccessible to short-read sequencing, e.g., SMN1 and SMN2 (Feng et al., 2017).
Updating analysis pipelines has been shown to increase the diagnostic yield on systematic retrospective re-analyses (Wright et al., 2018).
In a recent multi-laboratory study of challenging variants, bioinformatic errors were a major cause of considerable inter-laboratory discordance, even among clinical laboratories (Lincoln et al., ). Our results suggest that updating the reference genome and aligner versions should be considered in any retrospective re-analyses of undiagnosed patient genome data. Additionally, alignment-free (Ostrander et al., 2018), or deep-learning based variant calling methods (Poplin et al., 2018) may be considered as maximally orthogonal approaches for reanalyzing data. Initiatives such as the Broad Institute's "Functional Equivalence" specification. PrecisionFDA (Petrone, 2016), Genome in a Bottle (Zook, Catoe et al., 2016;Zook, Chapman et al., 2014)  poorly covered genes with potential pathogenic variation.

ACKNOWLEDGMENTS
The authors thank the family for their participation in this study. The authors thank the Kinghorn Centre for Clinical Genomics for assistance with production and processing of whole genome sequencing data. The authors thank Marie-Jo Brion and Bronwyn Terrill for helpful suggestions for this manuscript. wrote the manuscript and led the project.

ETHICAL COMPLIANCE
Genetic studies were approved by local ethics committees and written informed consent was obtained for molecular genetic analysis.