A comprehensive toolkit to enable MinION sequencing in any laboratory

Long-read sequencing technologies are transforming our ability to assemble highly complex genomes. Realising their full potential relies crucially on extracting high quality, high molecular weight (HMW) DNA from the organisms of interest. This is especially the case for the portable MinION sequencer which potentiates all laboratories to undertake their own genome sequencing projects, due to its low entry cost and minimal spatial footprint. One challenge of the MinION is that each group has to independently establish effective protocols for using the instrument, which can be time consuming and costly. Here we present a workflow and protocols that enabled us to establish MinION sequencing in our own laboratories, based on optimising DNA extractions from a challenging plant tissue as a case study. Following the workflow illustrated we were able to reliably and repeatedly obtain > 8.5 Gb of long read sequencing data with a mean read length of 13 kb and an N50 of 26 kb. Our protocols are open-source and can be performed in any laboratory without special equipment. We also illustrate some more elaborate workflows which can increase mean and average read lengths if this is desired. We envision that our workflow for establishing MinION sequencing, including the illustration of potential pitfalls, will be useful to others who plan to establish long-read sequencing in their own laboratories.

and 27, Table 1) had close-to-optimal ratios of 2.1 and 2.3 respectively. The sample with the 1 3 7 low 260/230 nm ratio yielded an order of magnitude less sequence data from a single 1 3 8 flowcell compared to the other two samples (0.7 Gb vs. ~7 Gb respectively, we recommend adhering to the DNA quality measures nominated above whenever possible, 1 4 2 or else to assume reduced sequencing outputs. We also advise establishing suitable DNA suggests that optimizing DNA extraction protocols can take several months. The manufacturer-recommended library preparations involving DNA repair and end-prep are  intensity (e.g. after staining with SYBR red or ethidium bromide) it is important to consider 1 5 9 that smaller DNA molecules incorporate less dye so appear fainter during imaging. For 1 6 0 example, even faint DNA smears below 10 kb can indicate the significant presence of short 1 6 1 DNA fragments that are best avoided if long-read lengths are a primary goal of the 1 6 2 sequencing effort (see below.) Failure to account for this can easily lead to overestimation of 1 6 3 mean DNA fragment length, and miscalculation of the true concentration of DNA fragments. As a starting point we defined the optimal DNA input based on our initial mean fragment 1 6 5 length estimate of 30 kb. This was followed by empirical adjustments from plotting 1 6 6 sequencing outputs vs. the DNA input into adapter ligation ( Figure 4). This approach 1 6 7 revealed an optimum of ~2 µg dsDNA (Figure 4), which required an input of 2.9 µg DNA for 1 6 8 the DNA preparation stage considering typical losses of 30% after clean-up using in house 1 6 9 SPRI beads (see below). Neither decreasing or increasing the DNA input improved the 1 7 0 sequencing output, due to too few adapter-DNA molecules, or too many free DNA molecules 1 7 1 damaging DNA pores. Assuming that 2.9 µg input DNA was the equivalent of 0.2 pmol, we 1 7 2 estimate a mean DNA fragment length of 23 kb for our sample preparation. This suggests 1 7 3 we initially overestimated the mean DNA fragment length, and highlights the difficulty of estimating these values based on gel imaging. At the same time, we stress that it is best to 1 7 5 establish optimal DNA inputs empirically for each DNA extraction and/or shearing protocol. In addition, one can use the sequence read-length distribution from the initial flowcells to 1 7 7 improve the estimate of the mean fragment length of the DNA extracted from the tissue. This approach can help to quickly optimise the amount of input DNA added to the ligation step. length. Optimizing these parameters is very important when optimizing DNA fragment length. To demonstrate this effect, we compared DNA fragment length with sequencing read lengths 1 8 9 between two sets of samples that were subjected to different tissue homogenisation 1 9 0 procedures. Our standard tissue homogenisation method for eucalyptus leaves consisted of 1 9 1 crushing frozen samples for 35 seconds with two 5 mm metal beads in a Qiagen tissuelyser 1 9 2 at 24 Hz. To maintain the frozen state, each Eppendorf tube as well as the grinding rack were frozen in liquid nitrogen before the homogenisation step. In an attempt to improve 1 9 4 throughput, we tested the effect of homogenizing samples in larger batches, which likely led 1 9 5 to a situation where not all samples were completely frozen throughout the procedure. This band. This suggests that the average DNA fragment length was reduced in this sample. To read length distributions; the mean read length dropped from ~13 kb to 4.9 kb, and the 2 0 8 median from ~7 kb to 2.5 kb (Table 2). This illustrates that even a slight change in DNA 2 0 9 smearing can have a huge impact on sequencing output. It is therefore important to carefully 2 1 0 assess DNA fragment length, if possible by comparison to other samples, by agarose gel or 2 1 1 PFGE to avoid short sequence reads.

1 2
Because our focus for this project was on generating reads >5 kb to assemble a repeat-rich 2 1 3 genome de novo, we reasoned that it would be beneficial to remove smaller DNA fragments 2 1 4 (<1-2 kb) from all samples. AMPure XP beads can be used to size-select DNA fragments in PEG and NaCl concentrations, which precipitate DNA in a cooperative manner, we might be 2 2 0 able to select a higher average DNA fragment length, and thereby remove unwanted smaller 2 2 1 DNA fragments (Lis & Schleif, 1975;Ramos, de Vries, & Ruggiero Neto, 2005). Using 0.8  We next assessed the effect of DNA shearing and gel-based size-selection procedures on 2 2 7 sequencing throughput and read length distribution. In the case of DNA shearing our 2 2 8 hypothesis was that a more unimodal size distribution of shorter DNA fragments with a peak 2 2 9 of about 20 kb ( Figure 3) would increase sequencing throughput. We used g-TUBEs with a 2 3 0 benchtop centrifuge to shear DNA by forcing it through a µm mesh. DNA shearing did not 2 3 1 increase yield, but did affect the read length distribution (Table 3). Compared with non-2 3 2 sheared samples, the sequence read length distribution from sheared reads shifted to an N50 Q7 of ~26 kb from the unsheared samples (Table 3). Whereas Q7 presents the default 2 3 5 quality threshold from the basecaller. Interestingly, the median read length from the sheared 2 3 6 DNA samples increased to 7.5 kb from 6.5 kb when compared to unsheared DNA. At the 2 3 7 same time low quality short reads were reduced in the sheared samples. Possibly, removing 2 3 8 long DNA fragments (> 50 kb) leads to fewer low quality reads caused by long DNA 2 3 9 molecules being stuck in the pore, at least when using the R9.5 pore chemistry. This highlights that filtering reads based on their q-scores, as well as removing short sequencing 2 4 1 reads, may help to avoid error propagation during downstream analyses of the data.

4 2
We also tested the effect of removing DNA fragments below 20 kb by size selection using 2 4 3 the BluePippin system in the high-pass mode which enables the collection of DNA molecules 2 4 4 above a certain size. When we applied the 20 kb high-pass filter we were able to remove 2 4 5 DNA fragments less than 20 kb while maintaining the high molecular weight size distribution high sample loss (65-75%), the increase in cost, and prolonged sample handling. Overall, we did not employ DNA shearing using g-TUBEs or size selection via BluePippin in 2 5 1 our final sequencing protocol even though they improved sequencing read-length 2 5 2 distributions. In our case, the high quality sequencing results achieved with our standard 2 5 3 protocol using the improved SPRI beads mixture did not warrant the additional time and  preparation and is predictive of a high final sequencing output. If the pore occupancy was 2 6 4 below 70% we stopped the run, washed the flowcell and loaded a new library to ensure high 2 6 5 throughput per flowcell (Figure 8). We reasoned that these low throughput runs were usually 2 6 6 due to insufficient DNA molecules being ligated to sequencing adapters during the library 2 6 7 preparation. We found that we had to load at least 1 µg library DNA onto a flowcell to 2 6 8 achieve acceptable yields (Figure 4). To ensure sufficient adapter ligated DNA, we started library preparation with at least 4 µg of high quality starting DNA to account for potential run it is likely that the library is contaminated and the pores are being irreversibly blocked or 2 7 4 damaged, or that the membrane has ruptured. If recognized early enough the flowcell can be 2 7 5 washed and a new library loaded, but the pores cannot always be recovered. Furthermore, 2 7 6 the length distribution from the length graph can be easily assessed and, if unsatisfactory, 2 7 7 the library exchanged for a separately prepared sample (Figure 8). We also recommend to 2 7 8 track the sequencing run with a continuous screenshot application (e.g. newlapse for linux, 2 7 9 https://github.com/mtib/newlapse), in addition to visual inspection during the first few hours  One key to ongoing optimisation of flowcells in our laboratories was the tracking of all 2 8 3 parameters for each sequencing run using our monitoring spreadsheet (Supplemental Table   2  (https://github.com/roblanf/minion_qc). After basecalling each flowcell, we ran this script and 2 9 0 examined in detail the length and mean quality distributions of the reads, and the physical 2 9 1 performance map of the flowcell. This allowed us to continually evaluate and improve our 2 9 2 protocols for each flowcell. Before we were halfway through our project, we were able to 2 9 3 reliably and repeatedly obtain more than 6 Gb of data from each flowcell, with mean read 2 9 4 lengths consistently above 12 kb. Here we present a complete workflow to establish MinION long read sequencing in any 2 9 7 laboratory. We highlight the importance of clean high molecular weight DNA for successful 2 9 8 sequencing runs and provide detailed wet lab DNA extraction and purification protocols that 2 9 9 include size selection. All these protocols and many others applicable to different starting  We would like to acknowledge fruitful discussion, leading to and improving this manuscript, MinION usergroup on protocols.io for sharing their protocols openly. Eucalyptus pauciflora leaf tissue was collected from Thredbo, New South Wales (NSW), 3 1 7 Australia. After harvesting the young twigs were transported in plastic bags and stored in 3 1 8 darkness at 4°C in water until DNA extraction. out with 800 -1000 mg leaf tissue which was cut into small pieces and split between 8 3 2 4 separate 2 mL Eppendorf tubes, each containing 2 metal beads of 5 mm in diameter, before 3 2 5 freezing in liquid nitrogen. We lysed the tissue mechanically by grinding using the Qiagen to 64°C for 30 minutes to inactivated DNases. One µL RNase A (10 mg/mL) (Thermo Fisher) 3 3 0 per 1 mL lysis buffer was added to the mixture after the heat treatment, followed by proteins were removed by centrifugation at 8000 g for 12 min at 4°C. We transferred the 3 3 6 supernatants to new tubes and purified the DNA from solution as described below in 3 3 7 "Removal of small DNA fragments < 1.5 kb with optimized SPRI beads".

8
We further purified the samples using a chloroform:isoamylalcohol extraction. The eight 3 3 9 aqueous DNA solutions were pooled to a total of 400 µL to which one volume of 3 4 0 chloroform:isoamylalcohol was added, mixing by inversion for 5 minutes. The phases were 3 4 1 separated by centrifugation at 5000 g for 2 minutes at room temperature (RT). We reduce DNA yields but potentially precipitates longer fragments in favor of shorter fragments.

4 7
The transparent pellet was washed with 70% ethanol and resuspended in 50 µL 10 mM Tris- preparation, for a maximum of 10 days. We processed 5 µg of pure HMW DNA through a g-TUBE (Covaris) in an Eppendorf 5418 3 5 3 centrifuge at 3800 rpm for a total of 2 minutes.   Read length comparison for samples sheared during the extraction process.

0 3
Comparison of N50 Q7, mean read length Q7 and median read length Q7 between untreated    read-length Q7 and median read-length Q7 of untreated samples (10) and (27) and Blue-Pippin size-selected samples (2) (Figure 3).  default HMW DNA extraction protocol with mean read length of 13 kb as shown in Table 2. as shown in Table 2.  extraction protocol (mean read length of 13 kb as shown in Table 4). Lane #3 same DNA 4 4 0 extraction as in #2 followed by size selection with the Blue Pippin using 20 kb high pass 4 4 1 filtering (a mean read length of 26 kb as shown in Table 4). Lane #4 same same DNA 4 4 2 extraction as in #2 followed by mechanical shearing with the a g-TUBE (a mean read length 4 4 3 of 11.8 kb as shown in Table 3). available DNA sequencing adapters in the ligation reaction. The red points mark outliers with 4 5 0 low yields due to a broken flowcell membrane and low yield due bad quality DNA input 4 5 1 (Table 1). These points were excluded from the calculation of the smoothed line. shows the same data, but with reads that have a mean quality (Q) score less than 7 4 6 4 removed. that only a small proportion of the sequenced bases were in reads shorter than 20KB), we 4 7 2 were able to obtain higher yields of reads >20KB from the libraries that were prepared contained a considerable yield of reads between 1KB and 20KB, which may be useful for 4 7 5 many applications.