Detailed analysis of HTT repeat elements in human blood using targeted amplification‐free long‐read sequencing

Abstract Amplification of DNA is required as a mandatory step during library preparation in most targeted sequencing protocols. This can be a critical limitation when targeting regions that are highly repetitive or with extreme guanine–cytosine (GC) content, including repeat expansions associated with human disease. Here, we used an amplification‐free protocol for targeted enrichment utilizing the CRISPR/Cas9 system (No‐Amp Targeted sequencing) in combination with single molecule, real‐time (SMRT) sequencing for studying repeat elements in the huntingtin (HTT) gene, where an expanded CAG repeat is causative for Huntington disease. We also developed a robust data analysis pipeline for repeat element analysis that is independent of alignment of reads to a reference genome. The method was applied to 11 diagnostic blood samples, and for all 22 alleles the resulting CAG repeat count agreed with previous results based on fragment analysis. The amplification‐free protocol also allowed for studying somatic variability of repeat elements in our samples, without the interference of PCR stutter. In summary, with No‐Amp Targeted sequencing in combination with our analysis pipeline, we could accurately study repeat elements that are difficult to investigate using PCR‐based methods.


INTRODUCTION
Sequencing of long stretches of repeated nucleotides is notoriously difficult and yet clinically important because the length and structure of repetitive regions are diagnostic markers associated with several severe human diseases (La Spada & Taylor, 2010;Lopez Castel, Cleary, & Pearson, 2010). The optimal way to study these regions would be to sequence long molecules of native DNA without any prior amplification, because PCR dramatically reduces the chance of successfully reading through regions with extreme guanine-cytosine (GC) content or highly repetitive regions (Eid et al., 2009;Roberts, Carneiro, & Schatz, 2013;Shin et al., 2013) and may also introduce biases and chimeric molecules. In addition, sequencing of native DNA opens up for the possibility to directly study base modifications (Flusberg et al., during library preparation (Antson, Isaksson, Landegren, & Nilsson, 2000;Dahl et al., 2007;Gnirke et al., 2009;Tewhey et al., 2009) and are used in conjunction with short-read sequencing technologies. While this is often a satisfactory approach, repeat elements and regions with extreme GC content (< 25% or > 65%) are major obstacles and have to be taken into consideration in the experimental design (Mertes et al., 2011). Although several targeted enrichment approaches have been adapted for long-read SMRT sequencing, for example, using long-range PCR (Ardui et al., 2017;Lode et al., 2017), or hybridization based approaches (Wang et al., 2015), most of these methods still include a PCR step in the sample preparation. Recently, a couple of new amplification-free protocols have emerged, which could generate new insights into repetitive regions of the human genome predisposing to severe genetic diseases and eventually lead to novel diagnostic assays (BioRxiv: https://doi.org/10.1101/203919, BioRxiv: https://doi.org/10.1101/110163; Pham et al., 2016).
Trinucleotide repeat disorders, such as Huntington disease (HD) and Fragile X syndrome, are caused by expansion of unstable nucleotide repeats. Trinucleotide repeat expansions account for at least 22 neurological disorders, where the repeat size underlies the broad spectrum of phenotypes observed in these disorders (La Spada & Taylor, 2010;Orr & Zoghbi, 2007). HD is an autosomal dominant progressive neurodegenerative disorder caused by an expansion of a CAG repeat in the huntingtin (HTT) gene (MIM# 613004) on chromosome 4 (Macdonald et al., 1993). Symptoms include chorea, ataxia, and personality disorders. The onset of the disorder is usually in adulthood and a longer repeat expansion generally implies an earlier onset. The number of CAG repeats can be divided into four different size ranges that correlate with disease phenotype. Alleles up to 26 repeats are considered normal, while 27-35 repeats are intermediate alleles with potential to expand into the disease range in the next generation. Alleles with 36-39 CAG repeats are HD-causing alleles with reduced penetrance, and the patient may or may not develop HD, and while alleles with ≥40 repeats are full-penetrance HD-causing alleles (Losekoot et al., 2013;Palomaki & Richards, 2012;Quarrell et al., 2012).
HD, as well as other trinucleotide repeat disorders, is typically diagnosed using PCR amplification of the repeat element, and the fragment size is determined by capillary electrophoresis. For very large expansions, Southern blotting protocols or triplet repeat primed PCR (TP-PCR) are recommended as complementary technologies (Losekoot et al., 2013). Fragment analysis, as well as Southern blotting and TP-PCR, is dependent upon accurate amplification and fragment sizing and does not analyze the DNA sequence itself. Studies have shown that in clinical HD analysis, 3%-13% of alleles fall outside error limits set by generally adapted best practice guidelines (Losekoot et al., 2013;Quarrell et al., 2012). An exact determination of the repeat count is important for clinical diagnostics of HD patients, in particular for the cases where repeat sizes cross borders of the repeat size ranges correlated with disease phenotypes.
Recently, we developed a protocol for targeted enrichment without amplification for use on PacBio's instruments (BioRxiv: https://doi.org/ 10.1101/203919). The method (named No-Amp Targeted sequencing) employs the CRISPR/Cas9 system for directed SMRT sequencing of DNA molecules that carries the target of interest. The combination of amplification-free target enrichment with SMRT sequencing provides a powerful tool for studying repetitive sequences and/or regions with extreme GC content. In our previous study, we described the method, how it has been optimized, and presented proof-of-principle data on human cell lines (BioRxiv: https://doi.org/10.1101/203919). In the present work, we apply No-Amp Targeted sequencing on DNA isolated from blood samples from individuals subjected to clinical HD diagnostics, with the aim to study repeat elements in the HTT gene as well as three other clinically relevant loci: FMR1 (MIM# 309550), ATXN10 (MIM# 611150), and C9orf72 (MIM# 614260). These additional loci harbor repeat expansions causative for Fragile X syndrome (CGG repeat), spinocerebellar ataxia type 10 (SCA10) (ATTCT repeat), and amyotrophic lateral sclerosis (ALS)/frontotemporal dementia (FTD) (GGGGCC repeat). In addition, we developed a robust analysis pipeline that automatically computes the repeat count on both alleles and visualizes the contents of the repeat sequence. Importantly, our analysis does not require an alignment of the sequence reads to a human reference, which is an advantage when examining these types of repeats that may often be of variable and unknown length. Previously, analysis of No-Amp Targeting data has been alignment based and dependent on a whole panel of reference sequences, containing all possible repeat lengths, to make sure that reads with variable repeat sizes could successfully be aligned (BioRxiv: https://doi.org/10.1101/203919).
Alignment-based approaches may be suitable in many situations, but they are not ideal in cases where it is difficult to make a priori assumptions on the content and structure of the captured sequence. For example, this could involve regions where several different repeats of variable sizes are present, or regions containing unexpected events such as insertions. By applying our analysis tool to amplification-free SMRT sequencing data from clinical HD samples, we obtain detailed sequence information for the HTT region, as well as the other captured loci, from hundreds of individual cells directly from the sequence reads.
This information can be used to accurately analyze repeat elements and to study heterogeneity in repeat size within the cell population.

Fragment analysis of HTT repeat expansions
All samples and controls were subjected to PCR with primers designed to target the CAG repeat in the HTT gene (5 ′ -CGGC GGTGGCGGCTGTTG-3 ′ and 5 ′ -FAM-CCTTCGAGTCCCTCAAGTCCT TC-3 ′ ). The PCR reaction contained 30 ng genomic DNA, 1XPCR reaction buffer (GC-rich PCR system, Sigma-Aldrich), and 0.25 mM dNTP, 0.5 M of primers and 0.5 U (GC-rich enzyme; Sigma-Aldrich). A 15minute enzyme activation at 96 • C was followed by 40 cycles of 96 • C for 15 seconds, 60 • C for 30 seconds, and 72 • C for 30 seconds, and a final extension at 60 • C for 45 seconds. Diluted PCR products were combined with HiDi formamide and ROX500 (Thermofisher) prior to denaturation at 95 • C for 5 minutes. The PCR products were separated on 3500xl Genetic Analyzer (Thermofisher), and the software Gene-Marker (SoftGenetics) was used for size determination. The precision of the assay has been determined by the laboratory to ±1 repeat for alleles > 50.

Design of guide RNAs
Guide RNAs (gRNAs) were designed using the human genome ref-  New England Biolabs), at 37 • C for 3 hours, in the presence of calf intestinal alkaline phosphatase (New England Biolabs) for genome complexity reduction (see Supporting InformationTable S1 for detailed information). The restriction enzymes were predicted not to cut within our target designs. Samples were fragmented with BamHI-HF and EcoRI-HF (New England Biolabs) in 37 • C for 3 hours followed by 20 minutes at 65 • C for enzyme inactivation. Subsequently, restriction-site-specific hairpin adapters (5 ′ -GATCATCTCTCTCTTTT CCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT-3 ′ and 5 ′ -AATTAT CTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT-3 ′ ) were ligated to the fragments to form SMRTbell libraries using E. coli DNA ligase (New England Biolabs). The adapter ligation was per-formed overnight at 16 • C followed by 20 minutes incubation at 65 • C for enzyme inactivation.

Library preparation and PacBio sequencing
The crRNA and tracrRNA with Alt-R modification (Intergrated DNA Technologies) were annealed in a 1:1 ratio to form gRNA that was used in the Cas9 (New England Biolabs) digestion of the SMRTbell libraries. Cas9 and gRNA in the presence of buffer were incubated at 37 • C for 10 minutes, before heparin was added, and the mixture was incubated for an additional 3 minutes at 37 • C. SMRTbell library was then added and incubated for 1 hour at 37 • C (see Supporting Information Table S1 for sample-gRNA combinations).
EDTA was then added to terminate the reaction and the SMRT- The asymmetric SMRTbell molecules were prepared for SMRT sequencing by primer annealing with standard PacBio sequencing primer lacking the polyA sequence for 1 hour at room temperature followed by AMPure PB bead (Pacific Biosciences) purification to remove excess primer. P6 polymerase was bound to the SMRTbell template/primer complex in the presence of free SMRTbell hairpin adapters to bind excess polymerase. The entire sample of enriched asymmetric SMRTbell molecules went into the primer annealing, due to unquantifiable amount of library at this point. Sequencing was performed on the PacBio RS II system using a modified MagBead One Cell Per Well protocol, C4 chemistry and 360 minutes movie time.

Primary analysis and alignment of PacBio reads
Asymmetric SMRTbell template sequencing data were subjected to a customized analysis pipeline for polyA-and conventional hairpin adapter recognition for separating subreads. The Reads of Insert tool in SMRT Portal (Pacific Biosciences) was used to create Circular Consensus Sequencing (CCS) reads from the subreads. Blasr (https://github.com/PacificBiosciences/blasr) was used to map the CCS reads to the human genome GRCh38. Mapping results were plotted in a histogram to visualize on-target and off-target effects.

Analysis of HTT and other repeats in PacBio data
It has previously been shown that trinucleotide repeats of at least 750 units can be accurately determined by SMRT sequencing and generation of CCS reads (Loomis et al., 2013). Because the CAG repeat in HTT is usually shorter than 100 units also for expanded alleles, we opted to use CCS reads instead of subreads as the basis for our analysis. To F I G U R E 1 Target design for the HTT repeat locus. BamHI is used in the fragmentation step in the library protocol, and the BamHI restriction site (shown in green) determines the start of the target design. A gRNA was designed downstream of the CAG repeat (shown in orange) and CCG repeat (shown in purple) in the HTT gene, and the Cas9 digestion site within the gRNA design is shown in red. The complete capture design is shown within the boundaries of the gray box. The lengths between the CAG repeat and the BamHI restriction site (l1) and the CCG and the Cas9 digestion site (l2) are used in downstream analysis of the repeat sizes F I G U R E 2 Overview of the data analysis method and visualization of results. (A) A schematic view of the HTT locus is shown at the top, followed by a step-by-step description of the analysis below. In the first step, CCS reads are generated and the figure shows a read containing the HTT target where the CAG repeat is represented by an orange color and the CCG repeat by a purple color, two recognition sites of length 14 bp (CCCT-CAAGTCCTTC and CCTCCTCAGCTTCC) flanking the repeat are shown in black, and remaining parts of the reads upstream and downstream of the repeat are shown in gray. In step 2, reads matching the HTT target are identified by the recognition sites (allowing for two indel mismatches), and step 3 further requires the observed length of the upstream and downstream parts of the reads to agree with the expected lengths (l1 and l2). In step 4, the reads are trimmed so the entire repeat sequence is extracted from the read. Finally, step 5 is an optional error correction that removes indel errors within the CAG repeat sequence. (B) The histogram shows the distribution of CAG repeats detected in the on-target reads for sample 10, with the two peaks at 21× and 29× representing the CAG repeat counts on the two alleles for this heterozygous individual. There is a distribution of reads having other repeat counts, and these can be explained either by somatic variation in the sample or by sequencing errors. The panel on the right shows a repeat-content plot for the same sample. Each horizontal line corresponds to a CCS read, where CAG trinucleotides are shown in red and CCG in blue. The gray dots in the red and blue fields represent positions that contain sequences that are different from CAG and CCG. (C) Data from the same sample as in (B), but after indel error correction in the repeat sequences. The error correction results in a histogram with more distinct peaks at 21× and 29×, and a repeat-content plot with fewer gray interruptions analyze the sequences in HTT and other repeat expansion targets, the CCS reads were used as input to a custom R script that identifies the on-target reads, extracts the repeat element, counts the number of repeat units for each of the alleles, and visualizes the results. The program also performs an optional error correction that removes singlebase insertions or deletions within the repeats. The outline of the analysis is shown in Figure 2A. The code is available from GitHub (https:// github.com/NationalGenomicsInfrastructure/HTT-repeat-analysis) along with CCS read data that can be used to execute the program.

Experimental setup
The No-Amp Targeted sequencing approach is an amplification-free target enrichment method that utilizes the CRISPR/Cas9 system, where the Cas9 functions as a directed endonuclease, coupled with SMRT sequencing. The method has previously been described in detail

Analysis strategy
The Although several computational tools can be used to study repeat elements in next-generation sequencing data (Dolzhenko et al., 2017;Liu, Zhang, Wang, Gu, & Wang, 2017;Tang et al., 2017), none of these has been specifically designed for the No-Amp Targeted sequencing protocol. We therefore decided to implement our own strategy, which is outlined in Figure 2A. Our aim was to create an automated analysis method that would first identify all the on-target reads, then extract and count the repeated units, and finally visualize the results. Importantly, we wanted all analysis steps to be performed without any alignment of reads to a reference sequence. To extract on-target reads, we searched for specific sequences of length 14 bp flanking the start and end of the repeat unit within all CCS reads produced for a specific sample. Only reads containing both the repeat start and end elements were kept. Moreover, the lengths of the sequences upstream and downstream of the repeat element were allowed to differ at most 10% from the expected lengths from the target design, which are indicated by l1 and l2 in Figure 1. By these criteria we could very specifically extract all reads containing an expected repeat target without aligning the reads to a reference. In a subsequent step, the repeat sequences were extracted from the on-target reads and the repeat units were counted.
The results were then visualized both as a histogram and as a colored image showing the repeat structure in each on-target read. Example results for the HTT repeat analysis is shown in Figure 2B.
As seen in Figure 2B, some errors were present in the data, mainly introduced by single-base insertions/deletions in the CCS reads. We therefore developed a method that allows us to correct for indel errors within the repeat units, which is the most common type of error in SMRT sequencing data (Eid et al., 2009). The principle behind the error correction is that a repeat unit containing one indel, which is flanked at both sides with at least two correct repeat units, can be corrected.
For example, when studying the CAG repeat in HTT, the sequence CAGCAGCGCAGCAG would be corrected to CACCAGCAGCAGCAG.
Our results suggest that this error correction removes most of the F I G U R E 3 Genome-wide coverage plots. Genome-wide coverage plots for replicates 1 and 2, prepared from HEK 293 cell line DNA, are shown. The y-axis shows the number of reads and the x-axis spans over all the chromosomes in the human genome. The color of the peaks shows which gRNA the peak correlates with, green for ATXN10, blue for FMR1, orange for HTT, and red for C9orf72. In addition to the on-target peaks for each of the gRNA, off-target peaks are observed for the HTT and the ATXN10 gRNAs indel errors and generates more accurate estimates of the repeat counts (see Figure 2C). However, the error correction should be seen an optional step as it not always advisable to modify the original reads.

Performance of No-Amp Targeted sequencing
To evaluate the performance and reproducibility of the No-Amp The coverage plots in Figure 3 show peaks representing sites that were not intended to be targeted by our assay. These off-target sites appear to be consistent over the entire set of replicates (Supporting Information Figure S2). Off-target effects are a known consequence of the CRISPR/Cas9 system caused by locations in the genome with sufficient similarity to the gRNA target sequence to induce Cas9 activity (Fu et al., 2013). The most striking off-target effect was found on chromosome 5, and further investigation of this site showed high homology between the HTT gRNA and an intronic region of the GALNT10 gene. The sequence at this site shows a 3-bp mismatch to the HTT gRNA (Supporting Information Figure S3A). Additional off-target sites were detected on chromosomes 4 and 9. These off-target effects were

Enrichment results for clinical HD samples
We further applied No-Amp Targeted sequencing to 11 clinical HD samples, resulting in an enrichment profile over the entire genome, similar to what was obtained for the HEK 293 replicates (Supporting Information Figure S4).  Figure S3B). The off-target effect found on chromosome 4 in HEK 293 was not observed in any of the clinical samples, and there is no known SNP variation that explains this variability in off-target effect. However, it is likely that the HEK 293 cell line carries a mutation in this region that increases the homology to the gRNA design.

Variation in HTT CAG and CCG repeat size in clinical HD samples
The most prevalent CAG repeat size for every allele in the HD samples, according to our analysis, agreed with previous data from fragment analysis (Table 1). Figure 4A shows the CAG repeat distribution for two HD samples, and corresponding results for the remaining samples are shown in Supporting Information Figure S5. Interestingly, alleles with fewer repeats (e.g., sample 1) showed less repeat size variation.
Conversely, alleles with large repeat sizes had a wider distribution of CAG repeats. One example of this is sample 11 that has an expanded allele ranging from 53 to 57 CAG repeats, with the highest peak at 54 ( Figure 4A). A similar distribution can also be seen in fragment analysis data for sample 11, but with a very weak signal for the expanded allele (Supporting Information Figure S6). This variability indicates a somatic mosaicism of HTT repeat sizes, which is a known molecular event both within and in between tissues in HD patients (De Rooij, De Koning Gans, Roos, Van Ommen, & Den Dunnen, 1995;Telenius et al., 1994) and is known to be more pronounced for larger repeat sizes (Telenius et al., 1994;Veitch et al., 2007).
In addition to resolving CAG repeat sizes, we analyzed the polymorphic CCG repeat that flanks the CAG repeat. In our 11 samples, the most common CCG allele both among non-HD-causing and HD-causing alleles is 7 CCG repeats, and the next most common is 10 CCG repeats, but although being polymorphic, no correlation between CCG repeat size and onset of HD has been found (Andrew, Goldberg, Theilmann, Zeisler, & Hayden, 1994). Among our sample set we found three different alleles, containing 7, 9, and 10 CCG repeats, respectively (Table 1; Supporting Information Figure S5). Fifty-six percent (6/11) of the individuals were homozygous for either 7 or 10 repeats, and 36% (4/11) were heterozygous with 7 and 10 repeats. One individual was heterozygous with 9 and 10 repeats. The most common allele in our data is 7 CCG repeats (63%), the second most frequent is 10 CCG repeats (32%), and the 9 CCG repeat allele was the least common (5%). This distribution is in good agreement with previous studies (Agostinho Lde et al., 2012;Andrew et al., 1994). Figure 4B shows two examples of different combinations of normal and extended CAG alleles and homozygous and heterozygous CCG alleles. No apparent correlation between pathogenic CAG expansions and CCG repeat count or heterozygosity was detected. However, for all heterozygous individuals, the longer CCG repeat was flanking the shorter CAG repeat.
F I G U R E 5 Detection of interruptions in the FMR1 CGG repeat. Repeat-content plots for the FMR1 repeat sequence in two of the individuals. Sample 6 (to the left) is heterozygous, with 28 and 22 CGG repeats on the two alleles. The allele with 28 CGG repeats is interrupted by two AGG repeats (shown in blue), whereas the allele with 22 CGG repeats only contains one single AGG interruption. Sample 10 (to the right) is also heterozygous, with 35 and 28 CGG repeats on the different alleles. For this sample, the longer allele (35 × CGG) contains one AGG interruption, whereas the shorter allele (28 × CGG) contains two AGG interruptions

Analysis of ATXN10, FMR1, and C9orf72 repeats
Even though our samples were selected for screening of the HTT repeat region, the multiplexing in our experiment also allowed us to analyze the captured sequences in ATXN10, FMR1, and C9orf72 (see Support- ing Information Table S3). As expected, no unusual repeat expansions were found. However, for individual 4, one of the alleles in C9orf72 contains 15 GGGGCC repeats. This is still within the range what is generally considered normal (< 25 GGGGCC) (Cruts, Engelborghs, van der Zee, & Van Broeckhoven, 1993), but is a considerably larger repeat compared to the other GGGGCC repeats in our data set (< 8 GGGGCC). In addition to counting the number of repeats on each allele, our analysis method makes it easy to determine the presence and exact location of repeat interruptions within the FMR1 molecules (see Figure 5). Information about repeat interruptions may in some cases have a direct clinical diagnostic value.

DISCUSSION
We have evaluated an amplification-free targeted enrichment method for studying repeat expansions in clinically relevant samples. With the No-Amp Targeted sequencing approach, we can obtain sequence information about the CAG and CCG repeats in the HTT gene without the concern of introducing bias by PCR.  (Heigwer, Kerr, & Boutros, 2014;Naito, Hino, Bono, & Ui-Tei, 2015;Perez et al., 2017) could result in a more sensitive assay. Also, it would be interesting to evaluate highfidelity Cas9 enzymes that have proven to decrease off-target effects while retaining on-target activity (Kleinstiver et al., 2016). A more specific and sensitive enrichment assay would lead to reduced requirements on DNA input amounts. At present, No-Amp Targeted sequencing requires at least 5 g of input DNA, and this limits the use of the method to specific sample types, such as blood, where it is easy to obtain large amounts of DNA.
Variability in the number of reads on-target was observed both between the HEK 293 replicates and in the patient samples (see Supporting Information Tables S2 and S3). The number of reads on-target was generally lower for the blood samples that for the HEK 293 replicates. This can partly be explained by the fact that the blood samples were treated with fewer restriction enzymes for complexity reduction, as well as by differences in DNA input amount (see Materials and Methods). However, variability in the number of on-target reads was also observed in cases where RE treatment was the same and when there were no large differences in DNA input. At present, we can only speculate about the reasons for this variation, but we believe that it is likely due to a combination of factors including sample quality and complexity, enzyme and gRNA stability, and sequencing related variabilities.
We have shown that multiplexing of targets is possible using the  (Ardui et al., 2017). Our results reveal a mosaic pattern of repeat sizes for larger repeat expansions in HTT, and because this observation is based on analysis of unamplified DNA molecules, this is likely to reflect somatic variation of repeat sizes in the original DNA samples (Telenius et al., 1994;Veitch et al., 2007). The only alternative explanation would be that the additional CAG triplets are being introduced during the PacBio sequencing or during sequence analysis. Both of these explanations are highly unlikely, especially because each molecule is independently sequenced several times to create a consensus (CCS) read from several independent subreads of the molecule. In order for errors to propagate into the CCS results, the exact same erroneous CAG triplets would have to be observed in a majority of the independently sequenced subreads. We use CCS reads as the source of input to our algorithm because the CCS approach is capable of generating unbiased and highly accurate sequencing reads for repeats in our size range (Loomis et al., 2013). However, there are also some drawbacks with using CCS. Most importantly, it is difficult to study large repeat expansion molecules that are too long to generate several subreads which can be combined into a single CCS read. Thus, the CCS approach should not be attempted when expanded alleles are suspected to be of 10 kb length or more because there will be a risk of allelic dropout.
For these cases, it might be necessary to use alternative methods, such as the tool recently proposed by Liu et al. (2017), which has the advantage that it can work on PacBio subreads.
Fragment analysis is a routine diagnostic genetic test for HD.
Failure of amplification of large expanded alleles can lead to allelic dropout and misinterpretation of the genotype as homozygous for a normal allele (Losekoot et al., 2013;Palomaki & Richards, 2012;Potter, Spector, & Prior, 2004). Polymorphisms in primer sites could be another reason for misinterpretation of diagnostic tests using fragment analysis (Holzmann, Saecker, Epplen, & Riess, 1997;Losekoot et al., 2013;Potter et al., 2004). Heterozygosity in the flanking CCG repeat may also contribute to incorrect calling of CAG repeat sizes (Losekoot et al., 2013) if the amplicons used for sizing the CAG repeats include the CCG repeat. Southern blotting or TP-PCR is usually used as a complement in cases where fragment analysis indicates that the individual is homozygous (Losekoot et al., 2013;Potter et al., 2004).
Although our targeted enrichment method avoids biases related to PCR amplification of these complex repeats, it is still vulnerable to polymorphisms in the guide RNA sequence or restriction sites. This study represents the first time No-Amp Targeted sequencing is used for HTT diagnostics, but it not yet ready to be implemented in clinical routine. For our method to replace fragment analysis, there is a need to reduce the current variation in on-target read number, simplify the laboratory protocol, reduce the requirements on amount of input DNA, and to lower the cost. With these improvements, our method could become a powerful tool for understanding the nucleotide repeat disorders in a clinical routine setting.
Somatic variation of HTT repeat expansions is a known phenomenon and has been studied previously, with the largest variability observed in the regions of the brain that have most neuropathological involvement in HD (Aronin et al., 1995). As we have also seen in our results, larger repeat sizes show greater somatic variability, which is consistent with previous reports that larger repeat sizes have an earlier onset of mutation instability (Kennedy et al., 2003). Correlation between the magnitude of repeat expansion size and age of disease onset has been observed, where the most prominent somatic size mosaicism has been seen in juvenile onset of HD (Kahlem & Djian, 2000;Swami et al., 2009). The common analysis method for exploring somatic variability of CAG repeats is small-pool PCR (SP-PCR) (Gomes-Pereira, Bidichandani, & Monckton, 2004) or single molecule PCR (Veitch et al., 2007), which depends on single molecule nested PCRs and detection by Southern blotting (SP-PCR) or fragment analysis (single molecule PCR). These methods are extremely labor-intensive, because numerous parallel PCR reactions have to be performed for each sample. Interpretation of results is affected by PCR stutter, and the true size variability may be hard to determine Veitch et al., 2007). Because No-Amp Targeted sequencing does not rely on amplification, we believe that our results provide a more accurate representation of the somatic variation compared to methods relying on bulk-PCR. We also believe that the No-Amp method simplifies experimental and analytical procedures in studies on somatic mosaicism of instable repeat expansions. No-Amp Targeted sequencing also has the potential to contribute to other aspects of repeat expansion studies. A unique advantage of SMRT sequencing is the ability to directly study base modifications, such as DNA methylation, which have been shown to influence the phenotype of Fragile X (Usdin et al., 2014) and might also be relevant in other repeat expansion disorders.
In conclusion, we have successfully applied a novel amplificationfree targeted enrichment method to study the trinucleotide repeat in HTT in clinical HD samples, as well as the three additional loci ATXN10, FMR1, and C9orf72. Our mapping-independent software allowed us to confidently analyze the unstable HTT repeat and to study repeat sequence variations such as interruptions in the FMR1 repeat. The PCR-free methodology makes it possible to study somatic and allelic variation without any influence of PCR stutter or other amplificationrelated biases.

AVAILABILITY
The data analysis code, CCS data, and user instructions are available from the following URL: https://github.com/NationalGenomics Infrastructure/HTT-repeat-analysis. For the 11 patient samples, only CCS reads corresponding to the four target sites (HTT, FMR1, ATXN10, and C9orf72) are available from GitHub.

ACKNOWLEDGMENTS
We sincerely thank the patients who participated in this study.

DISCLOSURE STATEMENT
YT, TAC, and PK are full-time employees at Pacific Biosciences.