A massive parallel sequencing workflow for diagnostic genetic testing of mismatch repair genes

The purpose of this study was to develop a massive parallel sequencing (MPS) workflow for diagnostic analysis of mismatch repair (MMR) genes using the GS Junior system (Roche). A pathogenic variant in one of four MMR genes, (MLH1, PMS2, MSH6, and MSH2), is the cause of Lynch Syndrome (LS), which mainly predispose to colorectal cancer. We used an amplicon-based sequencing method allowing specific and preferential amplification of the MMR genes including PMS2, of which several pseudogenes exist. The amplicons were pooled at different ratios to obtain coverage uniformity and maximize the throughput of a single-GS Junior run. In total, 60 previously identified and distinct variants (substitutions and indels), were sequenced by MPS and successfully detected. The heterozygote detection range was from 19% to 63% and dependent on sequence context and coverage. We were able to distinguish between false-positive and true-positive calls in homopolymeric regions by cross-sample comparison and evaluation of flow signal distributions. In addition, we filtered variants according to a predefined status, which facilitated variant annotation. Our study shows that implementation of MPS in routine diagnostics of LS can accelerate sample throughput and reduce costs without compromising sensitivity, compared to Sanger sequencing.


Introduction
Sanger sequencing (Sanger et al. 1977) has been the gold standard for DNA sequencing during the past decades. However, since the commercial launch of the first massive parallel sequencing (MPS) platform, the Genome Sequencer FLX System from 454 Life Sciences (Roche) in 2005, the sequencing technology has undergone a rapid development (Margulies et al. 2005). MPS has especially been embraced by genomic research because it facilitates performance of complex genetic studies that were not technically or economically feasible with Sanger sequencing.
Use of MPS in genetic diagnostics was limited in its initial phase, mostly due to the costs and capacity of the first MPS platforms. Genome and exome sequencing are examples of applications that these platforms are designed for, while for clinical applications sequencing of subsets of genes are of most interest. In 2010, Roche introduced the GS Junior, which is a small benchtop sequencing platform more compatible with the needs of a diagnostic laboratory. This system produces 100,000 shotgun and 70,000 amplicon high-quality, filtered reads, each 10-h run. Average read length is 400 bp. The performance of this system should be comparable to the well-documented GS FLX System (Rothberg and Leamon 2008;Liu et al. 2012) because they both use 454 sequencing chemistry, which combines emulsion polymerase chain reaction (PCR) of single-stranded DNA molecules and massive parallel pyrosequencing.
Usually, the DNA variant detection strategy in a diagnostic laboratory consists of a screening method like high-resolution melting (HRM) analysis followed by confirmation of the detected variant by Sanger sequencing. The large capacity of MPS platforms and the possibility to multiplex samples can improve screening efficiency. Several papers have presented MPS protocols for diagnostic purposes, the vast majority for analysis of BRCA1 and BRCA2 (Morgan et al. 2010;Walsh et al. 2010;De Leeneer et al. 2011b;Feliubadalo et al. 2012;Hernan et al. 2012;Michils et al. 2012) but also Neurofibromatosis type 1 (Chou et al. 2010). Different strategies have been utilized, like sequence capture with subsequent MPS (Chou et al. 2010;Walsh et al. 2010) and commercial or in-house amplicon-based sequencing methods (Morgan et al. 2010;De Leeneer et al. 2011b;Feliubadalo et al. 2012;Hernan et al. 2012;Michils et al. 2012). For this study, we selected the DNA mismatch repair (MMR) genes MSH2 (MIM #609309), MLH1 (MIM #120436), MSH6 (MIM #600678), and PMS2 (MIM #600259) to optimize a Massive parallel amplicon-sequencing analysis. Mutations in any of these genes are the causes of most Lynch Syndrome (LS) cases (Peltomaki 2005). LS, also known as Hereditary Nonpolyposis Colorectal Cancer (HNPCC), predispose to colorectal cancer (CRC) and are autosomal dominantly inherited. It is the most common hereditary CRC syndrome and account for 3-4% of all CRCs (Hampel et al. 2008). Extracolonic cancers in the endometrium, ovary, stomach, hepatobiliary tract, upper urinary tract, small bowel, pancreas, and brain are also associated with LS (reviewed in Bozzao et al. 2011).
PMS2 analysis is often neglected from genetic testing of LS due to the presence of multiple pseudogenes. Strong homology between several pseudogenes and gene sequence introduce difficulties for reliable variant detection (Nicolaides et al. 1995;De Vos et al. 2004;Nakagawa et al. 2004). There are 15 different PMS2 pseudogene loci identified in the human genome. The majority share homology with the 5′ end of PMS2 containing exon 1-5 while the pseudogene PMS2CL share homology with exons 9 and 11-15, where exon 12 and 15 are identical to PMS2 (De Vos et al. 2004;Nakagawa et al. 2004). In addition, sequence exchange between the 3′ region of PMS2 and the pseudogene PMS2CL may cause detection of pseudogene sequence (Hayward et al. 2007). To our knowledge, we are the first to present a MPS workflow for diagnostic analysis of the MMR genes associated with LS, including PMS2. Careful primer design that utilizes differences in the PMS2-gene and pseudogenes was essential to avoid coamplification and subsequent sequencing of pseudogenes.
We optimized an in-house amplicon-sequencing approach analyzed on the GS Junior system. This study provides an example of how this can be performed starting with amplicons originally designed for Sanger sequencing and using standard data analysis software, accompanying the sequencing system. The strategies for achieving coverage uniformity and dealing with under or overcalling of homopolymer (HP) stretches are different from previously reported. The MPS workflow is compared to Sanger sequencing with regard to, workload, sample turnaround time, specificity, and sensitivity.

DNA samples
In total, 55 DNA samples isolated from ethylenediaminetetraacetic acid preserved whole blood with iPrep Pure link gDNA Blood kit were included in the study. All patient samples were obtained with written informed consent.
The capability of the GS Junior system to detect insertion and deletion variants was evaluated using 23 samples previously characterized by Sanger sequencing in a diagnostic setting. The previously identified variants were 14 deletions, seven duplications, and two indels. These samples were sequenced only for the amplicon containing the variant (coverage results not shown).
To setup and optimize our MMR gene MPS workflow, 32 samples included in a local CRC biobank described elsewhere (Trano et al. 2010) were sequenced for MLH1, PMS2, MSH6, and MSH2. For 16 of these samples, one or several MMR genes had previously been Sanger sequenced (six samples for all four MMR genes, one sample for MLH1, MSH2, and PMS2, three samples for MLH1 and PMS2, three samples for MSH2 and MSH6 and one sample was only sequenced for PMS2). Only the previously identified variants were used for evaluation of MPS sensitivity. The remaining 16 samples, previously uncharacterized, were used to test the workflow in a diagnostic setting.

Library preparation
We used an in-house amplicon-sequencing procedure similar to De Leeneer et al. (2011b). Only minor modifications to our original Sanger amplicon design were necessary to make it suitable for MPS. The workflow consisted of two parallel parts, one MPS part and one Sanger sequencing part (Fig. 1). The MPS procedure involved two rounds of PCR. First, a target-specific singleplex PCR, using fusion primers with universal tail sequences. Amplicons from the singleplex PCRs were pooled before the second multiplex PCR was carried out, using MID (Multiplex Identifier)-barcoded primers targeting the universal tails in the first PCR round. MID-barcodes are used by the GS Junior system software to identify the individual patient samples. All PCRs were performed on the Techne TC-512 thermal cycler.

Singleplex universal-tailed PCR
For each sample, 88 fragments were amplified to cover the complete coding and splice site regions of MSH2, MSH6, MLH1, and PMS2. The amplicon lengths ranged from 283 to 489 bp. PCR setup was performed by Hamilton STARlet liquid handling workstation. The total volume for this PCR was 25 lL and included 29 SensiMix TM HRM (Bioline, London, U.K.), 1.4 mmol/L MgCl 2 , 0.28 lmol/L of each primer (forward and reverse), and 30 ng DNA. Three different touchdown PCR programs were used. The thermal cycling conditions were: denaturation at 95°C for 10 min, then 16/8 cycles of denaturation at 95°C for 30 sec, annealing at 61/64/66°C (decreasing 0.5°C for each cycle), and extension at 72°C. Then, 24/32 additional cycles followed consisting of denaturation at 95°C for 30 sec, annealing at 53/56/62°C for 30 sec, and extension at 72°C. Final extension was carried out at 72°C for 5 min.

MID-barcoded multiplex PCR
After the first PCRs (singleplex universal-tailed PCR), 78 (10 were Sanger sequenced) singleplex amplicons were pooled to eight amplicon pools for each sample. To ensure uniform coverage distribution, the different singleplex amplicons were pooled at different ratios with volumes varying from 1 lL to 20 lL depending on the coverage obtained in the previous sequencing analysis. Composition of the different amplicon pools are shown in Table A1. Each pool was diluted 100 times and 1 lL was used as template for the multiplex PCR with MIDbarcoded primers. The amplification mixture included 59 AccuPrime TM GC-Rich Buffer A (Life Technologies, Paisley, U.K.), 1 U Accuprime TM GC-Rich DNA Polymerase (Life Technologies), and 0.28 lmol/L of each MID-barcoded primer (forward and reverse). Total volume for this PCR was 25 lL. The PCR program consisted of a 5 min denaturation step at 95°C, 20 cycles of denaturation at 95°C for 30 sec, annealing at 58°C for 30 sec, extension at 72°C for 1 min, and final extension at 72°C for 5 min.
The eight multiplex PCR products were pooled to one total sample pool that consisted of all 78 amplicons from one sample. The pools were purified with Agencourt â AMPure â XP (Beckman Coulter, High Wycombe, U.K.) according to the library purification procedure described by Roche in the Amplicon Library Preparation Method Manual. Bioanalyzer 2100 (Agilent, Santa Clara, CA) was used to evaluate the fragment lengths in the amplicon pools. DNA concentration of the pools was measured with Nanodrop 8000 Spectrophotometer (Thermo Scientific, Wilmington, DE) and the eight total sample pools were equimolarly pooled again into one mix containing all amplicons from all samples to be sequenced in a single-GS Junior run.

Massive parallel sequencing
The emulsion PCR protocol (Lib-A) recommends an input of two molecules of library DNA per capture bead. Based on our own (unpubl. data) and others (Zheng et al. 2010;Jiang et al. 2012) experimental experience, a lower molecule-to-bead ratio gives more desirable sequencing results because the amount of nonreadable beads with more than one DNA molecule is reduced. We used a molecule per bead ratio of 0.5 and obtained an enrichment percentage of~5% (500,000 beads containing DNA). Loading only 500,000 beads (as opposed to 2,000,000 beads, which is the upper limit) onto the Pico Titer Plate (PTP), reduce light signal interference from neighboring wells and further increase sequencing quality. Sequencing on the GS Junior was performed according to manufacturer's instructions.

Data analysis
The reads from each GS Junior run were analyzed using GS Amplicon Variant Analyzer v2.7 (AVA) software (Roche, Basel, Switzerland). All reads that passed the AVA default filters were aligned to PMS2 (NG_008466.1), MSH6 (NG_007111.1), MLH1 (NG_007109.1), and MSH2 (NG_007110.1). We applied some additional filters, which we considered useful to reduce the number of false positives (FP) without risk of losing any true positives (TP). Only variants present in both forward and reverse reads, with a combined variant frequency (VF) of at least 15% were further considered. Theoretically, this filter setting requires a minimum coverage of 18 to detect a heterozygote variant with a probability of 99.9% (Phred score of 30) (De Leeneer et al. 2011a). However, as this theoretical value only takes sampling effects into consideration, we found it to be too low. Our experience (see below) is that sequence context also affects allele frequencies and we therefore elevated the coverage threshold to 38. This threshold has also been used in other studies (De Leeneer et al. 2011b;Feliubadalo et al. 2012). Under and overcalling of HP regions is a well-known problem with pyrosequencing (Huse et al. 2007). To separate TP from FP calls in HP regions, we did a cross-sample comparison and evaluation of flow signal distributions (approach recommended by Roche). If the variant was present at similar frequency in forward and reverse directions across all samples, the variant was considered to be a FP. A signal distribution is a histogram of all the flow signals of forward and reverse reads that align to a specific position. When viewing flow signal distributions, the TP variants are expected to give dual peaks, while single peaks are expected in case of FP variants (Fig. 2). Note that evaluation of distribution signals requires that the variant is called with sufficient reads in the forward and reverse direction. Identified TP variants were annotated according to Human Genome Variation Society guidelines using transcript references NM_000251.1 (MSH2), NM_000249.3 (MLH1) NM_000179.2 (MSH6), and NM_000535.5 (PMS2), and classified using a five class system (1 = neutral, 2 = likely neutral, 3 = uncertain, 4 = likely pathogenic, and 5 = pathogenic) (Plon et al. 2008;Spurdle 2010). We further utilized the "variant status" filter in AVA that allows users to filter the variants based on their predetermined status. Variants can be set as either "accepted," "putative," or "rejected," We found it useful to define all recurrent HP FP variants as rejected and all polymorphic variants (nonpathogenic) as accepted. All remaining variants (nonpolymorphic and new FPs) identified was set as putative by the software. In this way, we could easily recognize and distinguish FPs and common nonpathogenic variants from potentially disease causing variants.
The AVA software includes a Command Line Interface (CLI) that was utilized to get coverage reports on each run, as the graphical user interface (GUI) does not provide this function. The script for retrieving coverage report is available by request.

Sanger sequencing
For each patient, 10 amplicons were Sanger sequenced after singleplex PCR. Of these, five amplicons contained HP regions ranging from 13 to 26 repeats and thereby caused base calling problems. Because of high similarity between PMS2 and PMS2CL, the primer pairs amplifying exons 13-15 had to be located in deep intron sequences to find sequences differences. Consequently, the corresponding amplicons were too long (575, 738, and 771 bp) to be sequenced by this MPS approach. The remaining two amplicons were Sanger sequenced because during previous runs they were consistently undercovered (<38 reads), even when including the entire singleplex volume in the multipliex. In addition, all nonpolymorphic variants, putative variants that could not be confidently determined as TP or FP and undercovered amplicons (<38-fold coverage) in the MPS workflow were also Sanger sequenced. Cycle sequencing reaction was performed with BigDye â Terminator v3.1 (Life Technologies) and subsequent capillary electrophoresis was performed by the 3130xl Genetic Analyzer (Life Technologies). Sanger data were analyzed using SeqScape Software v2.5 (Life Technologies).

cDNA sequencing to confirm PMS2 variants
To confirm that identified PMS2 variants truly originate from the PMS2 gene rather than any of the pseudogenes, we performed cDNA sequencing on seven samples containing PMS2 variants identified in exons or exon/ intron boundaries. RNA was isolated according to manufacturer's description from PAXgene Blood RNA tubes (Qiagen, Venlo, Limburg, Netherlands) using the nucleic acid purification kit PAXgene Blood RNA kit (Qiagen). We performed one-step RT-PCR to amplify the entire PMS2 transcript in two overlapping fragments. Exon 10 was utilized as an anchor for reverse primer for the first fragment spanning exons 1-10, and forward primer for the second fragment spanning exons 10-15. As exon 10 is not present in any of the pseudogenes, this design ensures a specific amplification of PMS2. For One-step RT-PCR, <200 ng RNA was amplified in 25 lL reactions using 0.2 lg of each primer, the enzyme mix SuperScript â III One-Step RT-PCR Platinum â Taq HiFi (Life Technologies), and 29 Reaction Mix containing 0.4 mmol/L of each dNTP. Cycling consisted of an initial cDNA synthesis step at 55°C in 30 min followed by denaturation at 94°C for 2 min. For the first fragment, denaturation step was followed by 38 cycles of 15 sec at 94°C, 30 sec at 59°C, and 2.5 min at 68°C. For the second fragment, denaturation was followed by 40 cycles of 15 sec at 94°C, 30 sec at 55°C, and 3.25 min at 68°C. Final extension was carried out at 68°C in 7 min for both fragments. The RT-PCR products were sequenced (as explained above) using primers amplifying the regions spanning the variants that were to be confirmed.

Results
Initially, 23 samples were sequenced only for the amplicons containing previously identified indel variants.
Subsequently, we sequenced the four MMR genes MSH2, MLH1, MSH6, and PMS2 in 32 patients divided on four GS Junior runs. Of these, 16 samples (run 1 and 2) were previously characterized with Sanger sequencing and were, together with the initial 23 samples containing indel variants, used to assess sensitivity and specificity of the GS Junior platform. The remaining 16 samples (run 3 and 4) were sequenced to test the utility of MPS workflow in a routine diagnostic setting. All variants identified in this study are presented in Table 1. From run 1 to 4, a strong optimization was performed to achieve uniform coverage distribution across all amplicons by adjusting amplicon pool ratios.

Variant calling, sensitivity and specificity
To assess insertion and deletion variant detection capabilities of the GS Junior platform, one run was specifically dedicated to this task. We sequenced 14 deletions, seven duplications, and two indel variants involving 1-14 bp and previously characterized with Sanger sequencing (footnoted 3 in Table 1). Two of these variants (c.680_683del in MSH2 and c.2156del in PMS2) were located in or close to HP regions. All variants were successfully called in 40-56% of total reads. More comparable to a diagnostic sequencing setup, we also sequenced 16 samples containing 37 unique variants (35 substitutions and two indel variants) previously characterized by Sanger sequencing (run 1 and 2). Only variants with combined VF of at least 15% and present in both forward and reverse reads were considered. When only considering the amplicons previously Sanger sequenced, 315 variants met filter criteria. Of these, 146 were TP and 169 were FP variants. Of the FP variants, 154 originated from 26 different HP regions ranging from 5 to 8 bp. Cross-sample comparison and evaluation of signal distributions in HP regions reduced the number of FP to 18 and did not discard any of the TP variants. Of these, three calls were in two different HP regions and the remaining 15 FP calls came from two unique calls in MSH2 (c.-243C>T) and PMS2 (c.-175A>T). Both of these variants were most likely caused by nonrandom sequencing errors introduced by the SensiMix TM polymerase  (Bioline), as they were not present in these fragments when amplified with AmpliTaq Gold 360 (Life Technologies). We are currently in the process of transferring singleplex amplification from SensiMixTM to AmpliTaq Gold 360. These two FP variants will therefore be eliminated in future runs. All 123 TP variants originating from 35 unique substitutions and two indel variants were detected. One of the indel variants, a deletion (c.1786_1788del in MSH2), was detected in~50% of both forward and reverse reads. The other, a duplication variant (c.3261dup in MSH6) located within a HP of eight repeats, was detected in 60% of the forward reads and 16% of the reverse reads, constituting 34% of total reads. Based on these results, the measured sensitivity (TP/ (TP + FN)) of this validation set was 100%. However, the maximum sensitivity of this sample size (n = 60 different variants) is 95% (95 CI) because of an 5% (3/n = 3/ 60 = 0.05) probability of a FN event not being represented in this validation set (Mattocks et al. 2010). To calculate specificity, we used the fraction of TP among all positives (TP/(TP + FP)), also known as positive predictive value (PPV), instead of the standard specificity (TN/ (TN + FP)). As all nucleotide positions coinciding with the reference sequence will be true negatives, the number of TN will be much larger than the number of FP. In this situation, the standard specificity will always be close to one, as the ratio will be dominated by TN, whereas the PPV will give a more informative value (Tompa et al. 2005;Zvelebil and Baum 2007). Hence, the specificity (PPV) after filtering was 46% (146/(146 + 169)). Crosssample comparison and evaluation of signal distributions increased the specificity to 89%. Theoretically, heterozygote and homozygote variants should be detected in 50% and 100% of the reads, respectively. In the four sequencing runs analyzed, the heterozygote detection range was from 19% to 63% of the reads (mean 46%). Homozygote variants were always found in >94% of the reads (mean 99%). Most of the TP variants with low detection frequency were located close to or in repetitive sequences. As an example, the variant with the lowest detection frequency (c.2006+6G>A in PMS2) was located between two short HP stretches (aaagtttt). This is a commonly occurring polymorphism and was always called at relatively low frequencies indicating that sequence context in close proximity to the variant can affect base calling efficiency. To evaluate the effect coverage has on VF, we plotted all the TP variants identified in run 1-4 against the coverage obtained for their respective amplicons (Fig. 3). The VFs were more variable at lower coverage, especially below 1009. As the coverage increased the heterozygote VFs came closer to 50%. Because of no sampling effect, less variation was observed for homozygous variants, also at lower coverage.

Variant annotation
An additional 16 samples (run 3 and 4) were sequenced to test the workflow. We utilized the possibility in AVA to filter variants according to a predefined status of the topical variant. All polymorphic (class 1) and FP variants identified in run 1 and 2 were defined as, accepted and rejected, respectively, and were used to filter variants identified in run 3 and 4. All other variants identified (new FPs and nonpolymorphic) were defined as putative by the software and required further investigation. In run 3, there were 34 different FP variants and 26 unique polymorphic variants recognized and filtered correctly as rejected and accepted, respectively, leaving 19 different variants defined as putative requiring further investigation. Of these, nine variants turned out to be FP calls in HP regions based on cross-sample comparison and evaluation of signal distributions, and one variant was a commonly occurring polymorphism. The remaining 10 variants were Sanger sequenced and confirmed nine nonpolymorphic (class 2-3) and one FP call in a HP region. The newly identified FP-and polymorphic variants were defined prior to run 4, in which only eight different variants were recognized as putative. Of these, seven variants remained putative after cross-sample comparison and evaluation of signal distributions. Sanger sequencing revealed that three variants were erroneous calls in HP regions and the remaining four were nonpolymorphic variants (class 2-3). Furthermore, 27 FP variants were recognized as rejected and 28 polymorphic variants as accepted. Indeed, the rejected variants were not totally disregarded; a visual inspection of variant frequencies and signal distributions (if necessary) was done to ensure that these calls were correctly assigned as FPs. All nonpolymorphic (class 2-5) variants identified in run 3 and 4 were confirmed by Sanger sequencing. In total, we identified 72 distinct and true variants (Table 1) by MPS, of which 23 were likely pathogenic or pathogenic variants (class 4 or 5). We could not find reference single-nucleotide polymorphism ID number for 19 of the variants. However, 13 of these variants have previously been reported by Sjursen et al. (2010). The variants c.3699_3702dup (MSH6), c.3848_3850dup (MSH6), c.4001+42_400+45dup (MSH6), c.823C>T (PMS2), c.1970del (PMS2), and c.2156del (PMS2) have, to our knowledge, not previously been published. For PMS2, 20 distinct variants were identified and none of these were found to be pseudogene sequence-specific variants. To confirm that these variants truly originated from PMS2, we performed cDNA sequencing of seven samples representing most of the 14 PMS2 variants located in exons or exonintron boundaries. However, three of the variants (c.1970del, c.2156del and c.1866G>A) could not be analyzed because RNA samples could not be obtained from the patients carrying the specific variants. The variants c.59G>A and c.823C>T could not be confirmed due to poor RNA quality (RIN 4.10) of the sample used. All the remaining nine variants (footnoted 4 in Table 1) were confirmed by cDNA sequencing to originate from PMS2 and not from any of the pseudogenes, indicating that our amplicon design is specific for amplification of PMS2. For the patients with disease-associated PMS2 variants, no other disease causing variants were identified in any other MMR gene.

Coverage results
For MPS to be implemented in diagnostic settings, the method needs to be cost-effective. Uniform distribution of coverage across amplicons permits a larger sample size to be analyzed in a single-GS Junior run and thereby reduce sequencing costs. In our approach, the singleplex amplicons were pooled in different ratios to obtain uniform coverage distribution. Optimization was performed by adjusting the pool ratios of the amplicons prior each run, based on the coverage results obtained in the previous run. A summary of the coverage results for each run is shown in Table 2. We used the next generation molecular diagnostics calculator developed by De Leeneer et al. (2011a) to calculate the "spread correction factor" and number of samples that can be included in each sequenc-  ing run. An excellent explanation for how to calculate how many samples that can be included in a sequencing run (based on spread correction factor) is also given in another study by this group (De Leeneer et al. 2011b). In our best optimized run (run 4), we achieved a spread correction factor of 2.21 allowing us to pool and sequence 10 samples in a single run. Figure 4 shows the distribution of coverage of all the amplicons sequenced in run 4. For this run, seven amplicons had a coverage <38. All these were caused by a single amplicon (MLH1_ex12B), which require further optimization.
Pooling mixed-length amplicons in the second multiplex PCR and the emPCR should in theory cause a length bias where the shorter fragments are more efficiently amplified than the longer amplicons. If this was the case, long fragments should be pooled at higher ratios than short fragments. In our library, the amplicon lengths differ with up to 206 bp. However, we did not observe a relation between amplicon length and the ratio of which the amplicon was pooled (data not shown), indicating that other factors like sequence context may be of more importance for how efficiently an amplicon is A B Figure 5. Correlation between amplicon length and the mean coverage for each amplicon in run 1 (A) and run 4 (B). amplified than the length (at least within the length range of the current study). Figure 5 shows a plot of the amplicon lengths against mean amplicon coverage in run 1 (before optimization) and run 4 (after optimization). The average coverage obtained for the different amplicons in run 4 is more focused around the mean value of 120 compared to run 1. This demonstrates that the approach of combining amplicons at different ratios based on their obtained coverage, efficiently counterbalance any effect length bias or sequence context has on amplification efficiency.

Time and cost evaluation
We compared the consumable costs and time used from singleplex PCR to the final result between the MPS workflow and our previous Sanger sequencing workflow for eight samples. We evaluated the hands-on time to be similar for the two approaches, while consumable costs were reduced to approximately one-third. Sample turnaround time was also substantially reduced from 14 days to 6 days, mainly attributed to shorter run-time on the GS Junior platform and more efficient data analysis.

Discussion
Increasing demand for genetic testing combined with the request of shorter turnaround times required a shift in sequencing methodology in our laboratory. Therefore, we evaluated if genetic testing using the GS Junior benchtop sequencer from Roche was suitable for implementation in a diagnostic laboratory in terms of sensitivity, specificity, distribution of coverage, hands-on time, costs, and sample turnaround time. We have developed and optimized a MPS workflow for analysis of the four MMR genes associated with LS that fulfilled all the above mentioned requirements. All true variants identified with Sanger sequencing were also detected by MPS.
An in-house amplicon-enrichment approach was found to be most compatible with diagnostic MPS of MMR genes compared to other enrichment strategies. Our laboratory has performed diagnostic analysis of MMR genes for many years, and only minor changes to our original setup was necessary to make it compatible with the new MPS workflow. Alternative to our approach is enrichment by long-range PCR, which has the advantage of amplifying target genes with only a few PCRs (Hernan et al. 2012). However, this approach amplifies large portions of intronic sequences that have waste sequencing capacity and thereby increase sequencing costs. Commercial solid-or liquid-phase DNAcapture methods enrich target sequences by hybridisation to oligonucleotides (Albert et al. 2007;Gnirke et al. 2009). Variable selection efficiency across the target regions and nonspecific capture of homologous sequences such as pseudogenes are major limitations for this enrichment procedure (reviewed in ten Bosch and Grody 2008;Gnirke et al. 2009). Use of DNA-capture methods in enrichment of MMR genes can therefore cause unreliable variant detection in PMS2 due to the presence of multiple pseudogenes. A recent study reported advances in data analysis to filter out detected pseudogene variants (Chou et al. 2010). Although this holds great promise, it still needs improvements and further validation before implementation in diagnostics. Another drawback with capturing is that the setup is fixed. Prior to sequence analysis of the MMR genes, immunohistochemistry (IHC) analysis is often performed to detect lack of expression of MMR-proteins in tumor tissue to guide which gene to target. As opposed to capturing, with amplicon-based methods, genes can be included in the sequencing runs according to IHC implications. Multiplex amplification of MMR genes using the commercial kit HNPCC MASTR TM (Multiplicom, Niel, Belgium) is another amplicon enrichment option. However, this kit does not offer specific amplification of PMS2. According to their own specification notes, exons 4, 5, 11-15 are nonspecific. This may cause unreliable variant detection in PMS2.
Compared to other studies, we chose quite loose filter settings (15% combined VF and present in both forward and reverse reads) to separate TP from FP variants. More stringent VF cutoff will reduce the number of FPs, but increase the risk of false negatives (FNs). As recommended by De Leeneer et al. (2011a), two recent studies, where the application was detection of hereditary disease causing variants, used a VF cutoff of 25% to detect a heterozygote variant (minimum coverage of 38% and 99.9% detection power) (De Leeneer et al. 2011b;Feliubadalo et al. 2012). Yet another study used a VF threshold of 35% (Jiang et al. 2012). In this study, VF thresholds of 25% and 35% would cause two FN and 15 FN, respectively. This clearly demonstrates that if thresholds are set too high to reduce amount of FP (increase specificity) it will cause FNs, which is unacceptable in a diagnostic setting where the sensitivity is more important than the specificity. Looser filter settings will cause poor specificity, but the only consequence is more FP variants to deal with, which may increase the workload of confirmatory Sanger sequencing. A major source of FP in our sequencing runs were incorrect base calling in HP regions. Different strategies for how to overcome this problem has been proposed like commercial HP assays that analyze coding HPs with capillary electrophoresis (Feliubadalo et al. 2012) or HRM analysis (De Leeneer et al. 2011b). We chose an alternative approach where we Sanger sequenced five amplicons with HPs >13 repeats that caused severe base calling problems. The remaining HPs were analyzed by MPS. Cross-sample comparison and evaluation of signal distributions of variants called in HP regions proved to be an efficient approach to identify TP variants and substantially reduced the number of FP variants requiring Sanger sequencing.
We developed this workflow to be used in a clinical diagnostic setting and our results suggest that the GS Junior system reliably detects substitutions and small insertion and deletion variants. However, heterozygote variant calling can be affected by sequence context, especially at lower coverage. As the coverage increases, the VFs come closer to the theoretically expected value of 50%. These aspects need to be taken into consideration when determining VF and minimum coverage thresholds. Although frameworks have been suggested (De Leeneer et al. 2011a), there are currently no common guidelines for analysis and interpretation of MPS data for reliable variant detection. Filter settings that effectively distinguish signal from noise probably need to be experimentally determined by each laboratory as they are likely to be dependent on enrichment technique, MPS technology, and software.
Reliable variant detection in PMS2 is challenging due to the presence of multiple pseudogenes. Strong homology between pseudogenes and the PMS2 sequence makes it difficult to design primers specific to PMS2. We designed primers utilizing the differences between PMS2 and the pseudogenes and our results indicate that the primers are specific to the PMS2 gene, as no pseudogenespecific sequence variants were detected. In addition, nine PMS2 variants were analyzed by cDNA sequencing and confirmed to truly originate from PMS2. However, sequence exchange between the 3′ region of PMS2 and the pseudogene PMS2CL, my lead to inclusion of PMS2CL sequences into PMS2 or vice versa (Hayward et al. 2007). In cases where pathogenic or pseudogenespecific sequence variants are detected, results will be confirmed by cDNA analysis or long-range PCR utilizing exon 10 (not present in pseudogene) as PMS2-specific primer location. Similar strategies have successfully been used in previous studies (Hayward et al. 2007;Etzler et al. 2008;Vaughn et al. 2010).
Detection of large deletions or duplications (whole exons or multiexons) should in theory be possible by MPS using relative ratios of reads. However, previous studies (De Leeneer et al. 2011b;Feliubadalo et al. 2012) have evaluated this method to be unreliable, due to the three amplification steps prior sequencing. Pending methodology improvements, we will continue to analyze large deletions and duplications in MMR genes with multiplex ligation-dependent probe amplification (MLPA).
Coverage uniformity across all amplicons is important to fully exploit the capacity of a GS Junior run. We performed coverage optimization by adjusting the pooling ratios of amplicons pooled from the singleplex PCRs to multiplex PCRs. This turned out to be an easy and effective way to achieve coverage uniformity. Our best performing run achieved a spread correction factor that outperformed a previously developed in-house method optimized by adjusting primer concentrations (De Leeneer et al. 2011b).
Using a molecule per bead ratio of 0.5, we achieved from 90725 to 143904 reads per run, which is substantially higher than guaranteed by Roche (70,000 reads/ run). However, we lost about 30% of reads per run, due to short reads that could not be mapped to the reference sequences. We have identified these short reads as primer dimers originating from amplification of the PMS2 gene. Although we further optimized the PMS2 PCRs, the short fragments could not be eliminated. Our current workflow is optimized for sequencing 10 samples in a single-GS Junior run. Removal of these short sequences can increase sample size even further and therefore alternative cleanup methods will be carried out to resolve this problem.
To conclude, we find MPS with the GS Junior system to be suited for routine clinical diagnostics offering reduced costs and accelerated sample turnaround, without compromising sensitivity. Still, we hope for future improvements of the method to overcome the challenges of incorrect base calling in HP regions. Our MPS workflow for variant detection in the four MMR genes is now well-implemented in the hospital laboratory, although we still rely on Sanger sequencing for some of the amplicons and to confirm nonpolymorphic variants.