Detection of GBA missense mutations and other variants using the Oxford Nanopore MinION

Purpose Mutations in GBA cause Gaucher disease when biallelic, and are strong risk factors for Parkinson’s disease when heterozygous. GBA analysis is complicated by the nearby pseudogene. We aimed to design and validate a method for sequencing GBA on the Oxford Nanopore MinION. Methods We sequenced an 8.9 kb amplicon from DNA samples of 17 individuals, including patients with Parkinson’s and Gaucher disease, on older and current (R9.4) flow cells. These included samples with known mutations, assessed in a blinded fashion on the R9.4 data. We used NanoOK for quality metrics, two different aligners (Graphmap and NGMLR), Nanopolish and Sniffles to call variants, and Whatshap for phasing. Results We detected all known mutations, including the common p.N409S (N370S) and p.L483P (L444P), and three rarer ones, at the correct zygosity, as well as intronic SNPs. In a sample with the complex RecNciI allele, we detected an additional coding mutation, and a 55-base pair deletion. We confirmed compound heterozygosity where relevant. False positives were easily identified. Conclusion The Oxford Nanopore MinION can detect missense mutations and an exonic deletion in this difficult gene, with the added advantage of phasing and intronic analysis. It can be used as an efficient diagnostic tool.


Introduction
The GBA gene encodes the lysosomal enzyme Glucocerobrosidase, deficiency of which leads to accumulation of glucosylceramide. Biallelic (homozygous or compound heterozygous) mutations in GBA cause Gaucher disease (GD), the most common lysosomal storage disorder. 1 Heterozygous GBA mutations are a significant risk factor for Parkinson's disease (PD), 2,3 with evidence of longitudinal changes in many carriers suggestive of prodromal PD. 4 GBA mutations are also associated with Dementia with Lewy bodies 5 and Multiple System Atrophy, 6,7 related conditions which also demonstrate aggregation of the alpha-synuclein protein. At present, more than 300 mutations have been linked to Gaucher disease, 8 and the number of studies analysing the prevalence and phenotype of GBA mutations in PD is rapidly increasing. 9 10,11 12,13 GBA comprises eleven exons and ten introns over ~8 kb on chromosome 1q21. A nearby pseudogene GBAP has 96% exonic sequence homology to the GBA coding region. 14 The region also contains the Metaxin gene (MTX1), and its pseudogene.
The existence of these two pseudogenes confers an increased risk for recombination between homologous regions, which can generate complex alleles. The homology between GBA and GBAP is highest between exons 8-11, where most of the pathogenic mutations have been reported, usually resulting from recombination events. 8 The complex regional genomic structure complicates PCR and DNA sequencing, and some exons are also problematic in exome sequencing 15 and whole genome sequencing. 16 Established analysis protocols usually involve PCR of up to three 5 fragments, carefully designed to not amplify GBAP, 17 followed by Sanger sequencing of coding exons. Illumina targeted sequencing protocols have also recently been developed. 9,12 In recent years, long reads produced by Single Molecule Real Time (SMRT) sequencing have become commercially available, and have several advantages over short reads. 18 Oxford Nanopore sequencing technology analyses a single DNA molecule while it passes through a pore, producing characteristic changes in current depending on the sequence. 19 The Oxford Nanopore MinION is currently the most portable long-read sequencer. It can be plugged into a computer through a USB connection and provides sequencing data and run metrics data in real time. It has been used for applications ranging from pathogen sequencing in the field, 20 to sequencing a whole human genome. 21 It is still not routinely used in human disease diagnostics, but has been successfully used for SNV detection in CYP2D6, HLA-A and HLA-B, 22 TP53 in cancer 23 and BCR-ABL1 in leukemia, 24 and for chromosome 20 in a recent whole genome sequencing study. 21 In the present study, we present an efficient laboratory and bioinformatic protocol for GBA analysis using the MinION. In addition to disease-causing variants, it can detect intronic ones, and provide phasing information. The MinION protocol can thus provide further insights into GBA than other sequencing technologies, and is ready to be considered for diagnostic use.

Overview, DNA extraction and PCR
Samples used in this study were derived from 17 individuals (Table S1). We used brain DNA from eight PD patients, one MSA, and one control, including two samples from brains of PD patients known to carry GBA mutations RecNciI and p.L483P (L444P). 17 Brain samples were provided by Queen Square and Parkinson's UK brain banks. We also used samples from saliva of seven living individuals, six of whom had previously been found to have at least one mutation in GBA, although these results were not known to MLS and CP, who performed the SNV analyses, until it had been completed. All individuals had given informed consent, and ethics approval was provided by the local research ethics committee. DNA was isolated from brain using Phenol-Chloroform, 25 and from saliva using Oragene-DNA kit.
We amplified an 8.9-kb GBA sequence, which covered all coding exons, the introns between them, and part of the 3' UTR region (chr1: 155202296-155211206; Figure   S1). We customised previously reported primers 26 to carry Oxford Nanopore adapters and barcodes for multiplexing. Primer sequences were npGBA-F: 5'-TTTCTGTTGGTGCTGATATTGCTCCTAAAGTTGTCACCCATACATG-3' and npcMTX1: 5'- Two DNA polymerases with appropriate optimised PCR conditions were used to amplify the GBA target region (Table S2): Expand Long Template PCR (Roche) and Kapa Hi-Fi polymerase (Kapa Biosystems). Amplicons were purified by Qiaquick PCR purification kit (Qiagen) and DNA concentration was measured by Qubit.

Barcoding, Library Preparation and Sequencing
For sample multiplexing, a barcoding step was carried out after generating the GBA amplicons with PCR Barcoding Kit I (Oxford Nanopore). Up to 12 samples were pooled in each case for library preparation according to the manufacturer amplicon sequencing protocol, starting with 1 µg of DNA and 1% ƛDNA CS spike-in for the dAtailing step, followed by purification using AMPure beads. Nanopore adapters were ligated to the end-prepped DNA, using the NEB blunt/TA ligase master mix recommended by the manufacturer. Flow cell priming was performed according to the requirements of each flow cell version. We first used R7.3 and R9 flow cells with 2D reads, where a molecule passes through the pore in both directions. After recent technical advances, we used 1D reads from a R9.4 flow cell.

Bioinformatic analysis
MinKNOW versions 0.51.1.62 and later were used for data acquisition and run monitoring. Metrichor versions v2.38.1033 -v2.40.17 were used for basecalling, demultiplexing and fast5 file generation. The software divides reads into "pass" and "fail", and only "pass" reads were analysed. We used NanoOK (version 1.25) 27 to obtain a wide range of quality control metrics, with Graphmap (version 0.3.0) 22 alignment, using the precise region targeted as reference. We first converted fast5 files to fastq using NanoOK, or Poretools (version 0.6.0) 28 with a 2-kb size cut-off.
NanoOK output included the N50 (the size at which reads of the same or greater length contain 50% of the bases sequenced), the commonest erroneous substitutions, and overall error estimates, notably the aligned base identity excluding indels (ABID), and identical bases per 100 aligned bases including indels (IBAB). We 8 aligned reads to the human genome (hg19) for detailed study and variant calling using GraphMap or NGMLR (version 0.2.6) 29 , both specifically developed for long reads. Samtools (version 1.3.1) were used where required to merge, sort and index bam files. Coverage was calculated using bedtools (version 2.25.0) coverageBed.
Data were viewed on IGV.
We used Nanopolish (versions 0.6-dev and 0.8.4) 20 to call variants over our target region. Nanopolish was specially developed to improve accuracy by reanalysis of raw signals after alignment, and used in a recent whole genome study. 21 It relies on a hidden Markov model which calculates the probability of the MinION data at the signal-level for a given proposed sequence. 30 We called variants setting ploidy to 2, and invoked the "fix homopolymers" option. When using Nanopolish 0.8.4, we had to use Albacore (version 2.1.3, Oxford Nanopore) to re-generate fastq files for analysis.
We filtered any indel calls smaller than 5 bases, due to the known problem of Nanopore in calling these, especially in homopolymer regions. 21,29 We reviewed the variant quality of all calls and visualised them on IGV. We used WhatsHap, designed for long reads, 31 to phase all true variants, and tag bam files for visualisation. We used Sniffles, another tool designed specifically for such data, to call structural variants. 29 All nomenclature is based on the Human Genome Variation Society guidelines, 32 using reference sequence NM_000157.3. The traditional numbering for GBA missense mutations, which omits the first 39 amino acids, is given in brackets to ensure easy comparability with previous literature. SNVs were annotated using ANNOVAR, 33 and viewed on www.varsome.com, which provides data from dbSNP, gnomAD 34 genomes and exomes where available, and other useful metrics. 9

Statistical analysis
This was performed using Graphpad Prism v.6.0 (Graphpad, CA, USA) using paired t-test and Spearman correlation analysis as indicated.

Validation of correct alignment and preliminary mutation detection
To test our method, we first performed sequencing with 2D reads on brain DNA, acquired over five runs, using two earlier chemistry versions (R7.3 and R9), and the Graphmap aligner. The basic metrics of all are shown in Tables S3-S4. Nine samples had GBA coverage >60 over these runs, and were analysed further. We confirmed that reads mapped to the GBA target region, with only ~1.1% of reads aligning to pseudogene (Table S4). Two of these samples were known to have heterozygous GBA coding variants (three SNVs comprising the "RecNciI" allele in S5, and p.L483P, or L444P in S8). These were called by Nanopolish, and were clearly visible on IGV (Figure 1). One additional coding SNV was detected in S5 (g.155205518C>G; rs1064651; p.D448H, or D409H). This can occur in cis with the RecNciI allele. 35 There were non-coding variants in all samples. One apparently novel intronic variant in sample S1 (chr1:g.155207565C>T; intron 6: c.762-196G>A) was confirmed by Sanger sequencing ( Figure S2). It has now been reported as a very rare SNP in dbSNP build 150 (rs979955939; minor allele frequency 0.0001) and gnomAD genomes (1 of 30,762 alleles). Several other non-coding SNPs were detected (Table S4). One additional candidate, which was not a known SNP, was detected in S3. Review on IGV revealed that the same base change was present in all eight other samples, with AF 10-20% in five of them ( Figure S3). NanoOK showed that A>G is the second commonest erroneous substitution in this sample (15.94%).
Sanger sequencing did not confirm this variant, demonstrating that a common base error, with frequent reads supporting the same variant in most samples, is a false positive.

Use of newer chemistry to test detection in Gaucher patients
For the next part of the study, given the rapid improvements in Nanopore chemistry, and availability of the newer R9.4 cells, we decided to test samples known to carry pathogenic mutations, to determine the potential for diagnostic use. We used the Kapa PCR protocol, because of a possible minimal error reduction (Table S4). We included DNA from the two previously tested PD brain samples carrying RecNciI and p.L483P (S5 and S8), two additional untested PD cases (one brain and one saliva), and six saliva samples which had previously been determined to carry heterozygous or biallelic mutations. These comprised four GD patients and two carriers, although their status and previously established genotypes were not revealed until after the analysis was performed. We multiplexed these 10 samples on a R9.4 flow cell.
NanoOK analysis showed high base accuracy for all samples (mean 93.2%) (Tables   S3 and S5). We aligned data using both Graphmap, and the newly developed NGMLR, with a mean GBA coverage >300, and minimal number of reads aligning to the pseudogene (average 0.78% and 1.97% of the reads aligning to gene with Graphmap and NGMLR respectively; Table S5).

Coding mutations are detected
We called variants using Nanopolish (version 0.8.4) on data aligned both with GraphMap and NGMLR. We first focused on coding SNVs, which were detected in eight of the ten samples, regardless of the aligner used (Table 1; Figure 2). The two untested PD patients S16 and S18 were negative. We detected all previously known coding missense mutations, at the correct zygosity. These included p.N409S (N370S) in three GD cases, in the homozygous state in two (S12, S14), and heterozygous in one (S17) (Figure 2A), and the second mutation in S17 (p.L105P; Figure 2B). In another GD patient we detected two other heterozygous pathogenic mutations (p.R502C, p.R535C; Figure 2C,D). In the "RecNciI" carrier (S5), the additional p.D448H mutation was confirmed ( Figure 2F), which would lead to this allele being designated as "RecTL". We also detected heterozygosity in three samples from individuals without GD for p.L483P (L444P) (Figure 2E), including the one tested earlier. The mean quality score for coding heterozygous SNVs was 638 (standard deviation 229), and the lowest 337.8. The lowest scores were in S17, which had the second lowest coverage (205.1). There was a non-significant trend for coding heterozygote SNVs in a sample to have higher mean quality scores with higher coverage (Spearman correlation r=0.77, p=0.10, for the six samples which had at least one). The cut-off for a true positive may therefore partly depend on coverage.

Non-coding SNVs are also detected, and false positives can be identified
We reviewed all other SNV calls, and noted several known SNPs present in the heterozygous or homozygous state, with quality scores also >500 (Table S6). We also noted seven SNVs that were reported in one or (usually) several samples with low quality scores (all but one <200), all but one intronic (Table S6). These were always transitions (G>A, A>G, or C>T). These base changes were identified as common errors by NanoOK (occurring in 13.31%,12.66%, and 11.95% on average of the relevant base respectively). Furthermore, review of these positions on IGV in all samples revealed a high percentage of reads with the aberrant base, including those where the SNV was not called (11-31%; Figure S4). We concluded that these were false positives. Some were shared by Graphmap and NGMLR alignments from the 13 same sample. Overall, however, the NGMLR alignments had significantly fewer false positives, mostly due to one SNP that was always called in Graphmap samples, but never in NGMLR (mean per sample 2.2 with Graphmap, and 1.2 with NGMLR; paired t-test p=0.0038). For the mean quality of NGMLR false positives, there was no correlation with coverage (r=0.07, p=0.91, for the seven samples which had at least one).

Structural variant detection and mutation phasing provides additional relevant information
Sniffles and Nanopolish both reported a 55-bp exonic deletion in S5 in the NGMLRalignment only, clearly visible on IGV in this alignment ( Figure S5). This sample had been previously designated "RecNciI" based on the presence of three pseudogene derived missense changes which comprise this genotype. Our detection of the additional missense change p.D448H, and the 55-bp deletion, both of which may coexist with the "RecNciI" mutations, would change the classification to a "c.1263del+RecTL allele", indicating a different site of recombination with the pseudogene than RecNciI. 8 Detecting this deletion can be difficult with Illumina targeted sequencing. 9 No other structural variants were reported.
We next phased all variants using Whatshap (Table S7). We verified that the four coding SNVs and the deletion in S5 were in cis, as well as five rare intronic SNPs already detected in the original analysis ( Figure 3). We confirmed compound heterozygosity in two GD cases, S7, heterozygous for p.N409S and p.L105P, and S15, heterozygous for p.R502C and p.R535C. We noted a haplotype comprising 8 later extended to a 70-kb haplotype designated 111. 37 One sample was homozygous and two heterozygous for Pv1.1 + (Table S7)

Discussion
We have sequenced a long-range GBA amplicon, covering all coding exons and introns, using the Oxford Nanopore Technologies MinION. We first validated our approach on brain DNA samples, using the early R7.3 and R9 chemistry. We then used the newer R9.4 chemistry, as used in human whole genome sequencing, 21  sequencing technologies by Oxford Nanopore and PacBio which can easily generate long reads of several kb, 18 and in the case of the Nanopore up to hundreds of kb. 21 Using long reads has several advantages, despite the lower accuracy at the base level, 18 some of which were evident here. The challenge of aligning short reads to regions with high homology is often not fully appreciated, 15 with false negatives in GBA targeted Illumina sequencing when the whole genome was used as a reference. 9 We observed minimal alignment to the pseudogene. We also detected a coding 55-bp deletion, which can be missed by Illumina data, 9 and intronic SNPs, an understudied area in GBA and other lysosomal disorders. 9 Finally, the long reads allowed the phasing of mutations, enabling a haplotype-resolved personalized assessment. This helps overcome the frequent problem of phasing, which may require analysis of relatives. 38 The nanopore chemistry, and bioinformatic tools available, have evolved considerably during the time in which this work was performed. We compared two aligners (Graphmap and the recently developed NGMLR), both of which gave negligible alignment to the pseudogene. NGMLR allowed detection of the 55-bp deletion, and halved the number of false positives. We thus recommend using NanoOK for quality control, NGMLR for alignment, Nanopolish for SNV calling, and Sniffles for structural variant calling. Nanopolish has been designed for SNV calling by correcting accuracy problems arising in nanopore default basecalling by reanalysing the raw signal data. 21 Nanopolish variant calling option uses a likelihoodbased method to generate haplotypes that serve as the reference sequence for the target region. 20 It has been instrumental in projects ranging from Ebola virus 20 to human genome sequencing. 21 A cut-off quality score of 320 in our work differentiated all true and false positives, although the quality score of true positives may partly depend on coverage, and calls with scores ~200-400 would require careful review, and possible Sanger analysis. Although even 120x coverage allowed detection of a p.N409S homozygote, we recommend 300x or more to ensure accurate determination of zygosity. In a human genome sequencing study with SNP analysis of chromosome 20, coverage of only 30x remarkably allowed SNP calling with accuracy ~95% against annotated variants, but zygosity was not always correctly determined. 21 We were able to identify and filter false positives, based on (1) the low quality score on Nanopolish, (2) the high % of these changes occurring as errors based on NanoOK, and (3) the significant percentage of aberrant bases at the same positions in all samples, even where not called as mutations. Notably, they were always transitions, which were also the main errors in whole genome sequencing using the MinION. 21 Current limitations include the inability to accurately resolve homopolymers and detect small insertions and deletions (indels), 21,29 and we did not attempt to do this, filtering any indel calls <5 bases. Sniffles can detect insertions and deletions, as demonstrated here, as well as complex structural variants. 29 Based on the rapid developments in the chemistry and bioinformatics, we expect calling of small indels and further reduction of false positive SNV calls in the very near future.
As treatments are now available, neonatal screening for lysosomal storage diseases is becoming commoner, 39 including in some cases Gaucher. 40,41 This relies on biochemical activity, often by blood-spot screening, 42 with several false positives in Gaucher, possibly due to carrier status. 41 Genetic confirmation is ultimately required, so a rapid and cost-effective method would be useful in this setting. The advantages of the MinION include the very low capital cost, space requirements, and turnaround time of the analysis. The cost per sample is likely to compare favourably with Sanger and Illumina sequencing in all settings, especially taking into account the ability to phase variants and detect structural variants in this complex region. Current R9.4 flow cells yields are at least 5 Gb of sequence, and often much more. For our 8.9 kb amplicon, 96 samples, which can be multiplexed on a single flow cell, would therefore achieve a mean coverage >1,000x, even if less than a fifth of the reads aligned successfully.

Conclusion
Oxford Nanopore is a versatile single-molecule real-time sequencing technology that has been used in several innovative applications, from detection of Ebola to proof-of-principle human whole genome sequencing. Here we demonstrate that the MinION can detect and phase pathogenic variants in GBA, and intronic SNPs that would not be detected by Sanger sequencing of exons. The rapid evolution of specific bioinformatic methods, and the improvements in accuracy and data yield, combined with the minimal footprint and capital investment, make the MinION a suitable platform for long-read sequencing of difficult genes such as GBA, both in the diagnostic and research environments. The samples are separated into GD, where two mutations were expected, and others, where up to one was expected. * indicates sample also included in early 2D read work. The old aminoacid notation is included. Zygosity is shown for each mutation in each sample (het=heterozygous, hom=homozygous). The Nanopolish quality score is shown for the NGMLR aligned data.   The IGV trace over exon 10 is shown for all samples sequenced with 2D reads.  The deletion is clearly visible as a drop in coverage (red bracket). Reads are grouped and coloured by haplotype for these variants, which are all on the bluecoloured reads. The arrows point to the SNVs (red= coding, blue= non-coding) and the red box to the deletion.