Long‐read nanopore DNA sequencing can resolve complex intragenic duplication/deletion variants, providing information to enable preimplantation genetic diagnosis

Abstract Background The adoption of massively parallel short‐read DNA sequencing methods has greatly expanded the scope and availability of genetic testing for inherited diseases. Indeed, the power of these methods has encouraged the integration of whole genome sequencing, the most comprehensive single approach to genomic analysis, into clinical practice. Despite these advances, diagnostic techniques that incompletely resolve the precise molecular boundaries of pathogenic sequence variants continue to be routinely deployed. This can present a barrier for certain prenatal diagnostic approaches. For example, the pre‐referral workup for couples seeking preimplantation genetic diagnosis requires intragenic dosage variants to be characterised at nucleotide resolution. Objective We sought to assess the use of long‐read nanopore sequencing to rapidly characterise an apparent heterozygous RB1 exon 23 deletion that was initially identified by multiplex ligation‐dependent probe amplification (MLPA), in a patient with bilateral retinoblastoma. Methods Target enrichment was performed by long‐range polymerase chain reaction (PCR) amplification prior to Flongle sequencing on a MinION long‐read sequencer. Results Characterisation of the deletion breakpoint included an unexpected 85‐bp insertion which duplicated RB1 exon 24 (and was undetected by MLPA). The long‐read sequence permitted design of a multiplex PCR assay, which confirmed that the mutation arose de novo. Conclusion Our experience demonstrates the diagnostic utility of long‐read technology for the precise characterisation of structural variants, and highlights how this technology can be efficiently deployed to enable onward referral to reproductive medicine services.


| INTRODUCTION
Approximately 40% of cases of retinoblastoma occur bilaterally, a finding which is indicative of heritable autosomal dominant susceptibility, attributable to a germline loss-of-function mutation in the RB1 gene (OMIM: 180200). Since individuals with heritable retinoblastoma are also at increased risk of developing non-ocular tumours, establishing a molecular diagnosis and evaluating at-risk family members are of critical importance.
Sequence-based analyses of the RB1 gene, traditionally performed by Sanger sequencing of the coding exons and immediate flanking regions, identify single-nucleotide or small insertion-deletion variants that account for the majority (80%-85%) of pathogenic variants. 1 The remaining heritable variants are typically either intragenic deletion/duplication events (discovered using quantitative polymerase chain reaction [PCR] or multiplex ligation-dependent probe amplification [MLPA ® ]), or larger deletions that span the 13q14 locus (detected by chromosomal microarray). This latter group of patients may also show additional developmental delay and birth defects. 2 A common feature of the techniques that are routinely deployed to identify dosage variants is their inability to determine the precise genomic boundaries and (in the case of copy number gains) the orientation, of the molecular event. Although the approximate minimum and maximum size of copy number variants can be estimated, the resolution is variable, depending on assay factors such as the density and locations of adjacent probes.
Over the last decade or more, the molecular diagnosis of rare genetic disorders has been revolutionized by massively parallel "next-generation" sequencing (NGS) methods. In particular, hybridisation-based target enrichment, combined with short-read sequencing, has become a predominant molecular diagnostic approach. Not only single nucleotide or small insertion-deletion variants, but also whole exon deletion or duplication events, can be identified from a single such dataset, although the detection of the latter copy-number variants requires the use of different informatics pipelines. This has expanded the scope of molecular investigations to genes that were not typically targeted by "off-the-shelf" reagents, and has extended the mutation spectrum of many rare disorders. 3 While assay sensitivity for these comparative read-depth approaches is affected by both the underlying genomic architecture of the targeted locus, and the mean depth of sequencing for a given sample, there has nevertheless been considerable enthusiasm to incorporate these methodologies into clinical practice. 4 Most recently, it has become feasible to deploy whole genome sequencing (WGS) for diagnostic purposes, offering greater opportunities to directly characterise structural variants at nucleotide resolution. However short-read sequencing technologies have limited capabilities for this purpose, largely because of their inability to generate unambiguous alignments spanning low-complexity repeat elements. Such repetitive regions are frequent sites of the breakpoints for deletions and duplications that arise due to non-allelic homologous recombination. In addition, WGS remains an expensive diagnostic approach when a sequence variant is already partially defined. By contrast, "third-generation" single molecule sequencers can generate long sequence reads that unambiguously define structural variants by spanning low-complexity regions. These instruments have therefore been used for both the targeted follow-up of complex alleles 5,6 and structural variant discovery. 7 Here, we describe the use of a low-throughput long-read nanopore device, the "Flongle", to delineate the molecular breakpoint of an apparent heterozygous RB1 exon 23 deletion, at nucleotide resolution. We assess the accuracy of the nanopore platform and highlight the importance of retrospectively characterising incompletely defined sequence variants. Analysis of the presented case enabled onward referral to a national preimplantation genetic diagnosis service; such scenarios are likely to be of increasing clinical importance.

| MATERIALS AND METHODS
An 8-month old male infant presenting with bilateral retinoblastoma was referred for molecular genetic analysis of the RB1 gene.
Following written consent, DNA was isolated from peripheral blood lymphocytes of the proband and his relatives using the Puregene standard salting out procedure (Qiagen GmbH  NanoFilt v.2.20 was used to filter low quality reads (Q < 10) and select those within a 7500−8000 bp size range (https://github.com/ wdecoster/nanofilt). 9 Processed reads were aligned to an indexed human reference genome (build hg19) using minimap2 v.2.16.
(https://github.com/lh3/minimap2). 10 SAM-to-BAM file conversion, BAM file indexing and read sorting by genomic coordinate were performed using samtools v.1.9 (http://www.htslib.org/). 11 In view of the excessive read-depth generated by the full dataset, the resulting BAM file was downsampled to 10% of the total read count (samtools v.1.9). A consensus de novo assembly of the variant-containing allele was generated from all available sequence reads using Canu v.2.1.1 (https://github.com/marbl/canu/). 12 This was analysed by pairwise comparison to the human reference genome, defined by the longrange PCR amplicon, using the Needleman-Wunsch algorithm (https://www.ebi.ac.uk/Tools/psa/emboss_needle/). 13 BLAST-like alignment tool (BLAT) was used to determine the genomic coordinates of the inserted sequence (http://genome.ucsc.edu/cgi-bin/ hgBlat). 14  To verify and more closely delineate the exon 23 deletion, a longrange PCR amplicon encompassing adjacent MLPA probes, was optimised (Supplementary Figure 1) and analysed by long-read nanopore sequencing. Summary run metrics are included in Supplementary Table 1. To select reads specific to the variant-containing allele, adaptor-trimmed sequences were filtered by read length.
Visual inspection of these data, following alignment to the human reference genome, revealed the boundaries of the deleted sequence +3835delins (2490-46_2520+4;ATGA), was assigned as "pathogenic" following interpretation according to the Association for Clinical Genomic Science best practice guidelines. 16 The variant was not identified on in-house or locus-specific databases (http://RB1.variome.org).
To appraise the value of individual raw nanopore reads, we extracted those with the highest mean basecall quality score.
Pairwise alignment between these individual reads and a Sangerverified curated reference, which included the variant site and adjacent sequence, yielded a maximum identity score of 98.6% (Supplementary Table 2). In addition to the mutation we describe, 23 pairwise mismatches were identified between the Canu assembly and hg19 reference sequence. Of these, 13 mismatches were located in a poly(N) region and 10 were not. Twenty-one mismatches were Sanger-validated; at 10 locations, the Canugenerated assembly was correct (Supplementary Figure 3). The remaining 11 mismatch sites were all located within poly(N) tracts (that varied in length between three and eight nucleotides) and their reported lengths were incorrectly underestimated, by a single nucleotide, by the Canu-generated assembly.
A multiplex PCR was subsequently optimised, incorporating a normal allele-specific reverse primer to work in conjunction with the variant allele-specific reverse primer; this provides a single assay for genotyping the indel in at-risk individuals. It was demonstrated that the variant had arisen de novo in the proband (is absent in his parents), and the mutation was also not detected in his sister (Figure 3).

| DISCUSSION
The adoption of WGS into routine clinical practice is enabling the complete characterisation of some structural variants, at nucleotide resolution, from a single assay. 17   To simplify the task of correctly assembling the sequences of both alleles from a mixed pool of normal and variant-containing reads, in silico size selection was performed. We next undertook de novo assembly of the quality-filtered read set (to overcome the increased error rate that is intrinsic to nanopore generated sequence reads) to produce a single consensus sequence for interrogation.
Pairwise comparison against a curated benchmark sequence (generated following Sanger sequencing), confirmed complete concordance between these sequences, verifying the validity of our long-read de novo assembly pipeline. We note that the lengths of poly(N) tracts were systematically underestimated, by a single nucleotide. The accuracy of the assembly is therefore likely to be dependent on the genomic architecture of the sequenced region. From an end-user perspective, the interpretation of a single consensus sequence, in combination with BLAT, proved much simpler than visual inspection of a Sanger sequencing chromatogram.
By characterising the variant breakpoint, it was possible to design a simple assay for familial diagnostic testing. This established that the variant arose de novo in the proband. The proband's unaffected sister was shown not to have inherited the variant, a clinically important finding, since the possibility of parental germline mosaicism could not be eliminated; this was estimated to be ∼2.5% prior to testing. 24 While the variant we describe was not identified in population or disease-specific databases, the breakpoint assay could be used to screen cohorts of RB1-mutation negative retinoblastoma patients to establish the precise prevalence of the mutation.
Our experience demonstrates how nanopore-based sequencing, using Flongle flowcells, could be routinely deployed for follow-up analysis of previously identified, but incompletely-defined, structural variants. Due to high sequence yield, our analysis used a downsampled dataset comprising 10% of the aligned reads. This suggests that the workflow could be readily adapted to allow concurrent analysis of multiple patient libraries; allowing an increase in laboratory throughput and reducing the per-patient assay cost. Furthermore, as nanopore reads can be analysed in real-time, there remains an ongoing possibility that runs could be terminated once sufficient data has been accumulated, leading to an overall reduction in test turnaround times.
In summary, we report how complete characterisation of a pathogenic dosage variant can enable onward referral to a national preimplantation genetic diagnostic service. This was efficiently achieved using a facile long-read workflow.

ACKNOWLEDGEMENT
The authors have no support or funding to report.

CONFLICT OF INTEREST
Dr Watson has received travel expenses to speak at an ONT organized conference.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon reasonable request.