Effective Variant Detection by Targeted Deep Sequencing of DNA Pools: An Example from Parkinson's Disease

Authors


Summary

Next-generation sequencing technologies will dominate the next phase of discoveries in human genetics, but considerable costs may still represent a limitation for studies involving large sample sets. Targeted capture of genomic regions may be combined with deep sequencing of DNA pools to efficiently screen sample cohorts for disease-relevant mutations. We designed a 200 kb HaloPlex kit for PCR-based capture of all coding exons in 71 genes relevant to Parkinson's disease and other neurodegenerative disorders. DNA from 387 patients with Parkinson's disease was combined into 39 pools, each representing 10 individuals, before library preparation with barcoding and Illumina sequencing. In this study, we focused the analysis on six genes implicated in Mendelian Parkinson's disease, emphasizing quality metrics and evaluation of the method, including validation of variants against individual genotyping and Sanger sequencing. Our data showed 97% sensitivity to detect a single nonreference allele in pools, rising to 100% where pools achieved sequence depth above 80x for the relevant position. Pooled sequencing detected 18 rare nonsynonymous variants, of which 17 were validated by independent methods, corresponding to a specificity of 94%. We argue that this design represents an effective and reliable approach with possible applications for both complex and Mendelian genetics.

Introduction

An increasingly available and affordable technology, next-generation sequencing (NGS) is now becoming the primary discovery tool in human genetics (Goldstein et al., 2013). This development also calls for rational, cost-effective and scalable study designs to generate sequence data for different scientific applications. Several early studies have successfully used an NGS approach to identify mutations underlying rare Mendelian disorders (Lupski et al., 2010; Ng et al., 2010), and many geneticists are now looking towards larger sample sets and common phenotypes.

Sequence data will be crucial to identify the functionally relevant variation behind association signals from genome-wide association studies (GWAS) and further explore the genetic architecture of common diseases. For genetically heterogeneous conditions, NGS might also be the most rational approach to screen for known mutations. Individual whole-genome or whole-exome sequencing is currently being used to study complex genetics, but will in many cases still be too resource-demanding for large sample sets. In this article, we outline a study design combining targeted capture with deep sequencing of DNA pools and demonstrate how this strategy may be applied to effectively sequence relevant genes in 387 patients with Parkinson's disease (PD).

PD is a common neurodegenerative disorder of complex aetiology. To date, 20 genetic loci influencing PD susceptibility have been identified through GWAS (Nalls et al., 2011; Lill et al., 2012; Pankratz et al., 2012). However, a minority of patients have monogenic forms of the disease, following dominant or recessive inheritance patterns. The proportion of PD caused by mutations in Mendelian genes varies considerably between populations (Nuytemans et al., 2010; Puschmann, 2013). Where previous efforts to determine mutation frequencies in large sample cohorts have been limited to a few mutations or genes, NGS now provides the opportunity for large-scale screening strategies.

We designed a targeted DNA capture panel of 997 exons in 71 genes relevant to PD. These included all proposed and established Mendelian genes, 24 genes adjacent to GWAS top hits, as well as a range of genes implicated in neurodegenerative disorders with overlapping clinical or pathological features. This panel was used for pooled sequencing of DNA from 387 PD patients in order to generate a data set that may serve as a resource for several different subprojects. The strategy for downstream filtering and validation of variants will need to be attuned to the specific hypothesis. In this study, our primary aim was to evaluate the performance of the targeted pooled NGS design. For this purpose, we decided to focus on the subset of genes where the coding variability in PD is currently best characterized, namely those implicated in Mendelian disease. We defined a target region corresponding to all coding regions of the six well-established autosomal dominant (SNCA; MIM# 163890, LRRK2; MIM# 609007 and VPS35; MIM# 601501) and recessive (PARK2; MIM# 602544, PINK1; MIM# 608309 and PARK7 or DJ-1; MIM# 602533) PD genes. In this article, we present the results from analysis of these genes with emphasis on quality measures and methodological issues.

Materials and Methods

Subjects

The study was approved by the Regional Committee for Medical Research Ethics (Oslo, Norway). All participants gave written, informed consent. Patients fulfilling United Kingdom Parkinson's Disease Brain Bank criteria for PD were recruited between 2007 and 2012 at Oslo University Hospital and Drammen Hospital, Norway. The mean age of onset was 53 years (range 27–78) and the proportion of males was 66%. All participants were examined by a neurologist and blood samples were collected by venopuncture.

Preparation of DNA Pools

DNA was extracted from whole blood by the same standard techniques for all samples. A selection of samples was tested on 1% agarose gel to assure the integrity of DNA. DNA was homogenized for 30 min in a thermoshaker at 50°C, and all samples were diluted to a working solution of approximately 50 ng/μl. Each sample was then carefully measured on a Qubit 2.0 Fluorometer (Invitrogen/Life Technologies, Paisley, UK) and further diluted with TE-buffer to 20 ng/μl. Subsequently, 10 μl, corresponding to 200 ng of DNA, were drawn from each of the samples, and mixed together with others in pools representing 10 individuals. We created 39 pools from a total of 387 individuals, three samples being present in two different pools.

Capture, Enrichment and Barcoding

The HaloPlex Target Enrichment System (Agilent Technologies, Santa Clara, CA, USA) relies on a tailored cocktail of restriction enzymes and customized probes to capture genomic regions of interest, which are subsequently amplified by polymerase chain reaction (PCR). We used the HaloPlex online design tool to create a 200 kb panel targeting 997 exons comprising 71 genes relevant to PD (Customized HaloPlex Kit 48 rxn, Art no.96005, Halo Genomics, currently Agilent Technologies, https://earray.chem.agilent.com/suredesign/). We used 39 unique barcodes from the HaloPlex Kit for this experiment. Target enrichment was performed according to the manufacturer's protocol (HaloPlex Target Enrichment System for Illumina Sequencing Version A, February 2012 and HaloPlex PCR Target Enrichment & Library Preparation Guide version 2.0, November 2011). In brief, 45 μl aliquots of each pool, corresponding to 900 ng of genomic DNA, were digested by restriction enzymes. Successful digestion was verified by gel electrophoresis, demonstrating expected bands at 125, 175 and 475 bp. DNA pools were then hybridized overnight to customized, biotinylated probes. Hybridized probes were captured with magnetic beads and target fragments were ligated to create circular DNA molecules. Subsequently, libraries were amplified by PCR, introducing unique index sequences that allow all pools to be sequenced together.

Sequencing

All libraries of target-enriched pooled DNA were analysed on a Bioanalyzer 2100 with high-sensitivity DNA chips (Agilent Technologies) to verify successful enrichment, demonstrating a smear of amplicons ranging from 225 to 525 bp. The samples were sequenced at the Norwegian Sequencing Centre, Oslo. All samples were run together on two lanes of an Illumina HiSeq 2000 to perform 100 bp paired-end sequencing.

Bioinformatic Analysis of Sequencing Data

A detailed description of the bioinformatic pipeline indicating the specific tools and parameters used is provided in Table S1. The HaloPlex workflow may create target fragments that are short enough for 100 bp sequencing to read into the adapter towards the end of reads. Adapter sequence and low-quality tails of reads were therefore removed with the software Trimmomatic 0.30 (Lohse et al., 2012). Reads were aligned to the reference genome (GRCh37) with bwa 0.5.9 (Li & Durbin, 2009). Picard 1.77 was used to index and compress aligned sequence files to bam-format. Further sequence data processing, assessment of coverage and mismatch rates, variant calling and filtration was performed with the Genome Analysis Toolkit, GATK 2.5.2 (DePristo et al., 2011), restricting the target region to PARK7 (NM_007262.4), PINK1 (NM_032409.2), SNCA (NM_000345.3), PARK2 (NM_004562.2), LRRK2 (NM_198578.3) and VPS35 (NM_018206.4). We used ANNOVAR (version 2012may25) to annotate called variants (Wang et al., 2010).

The error rate of NGS usually requires some kind of quality filtering to avoid large numbers of false-positive variants. For this study, we applied quality filters with loose parameters in order to maintain a high sensitivity (Table S1). Restricting the analysis to mutations of potential relevance to monogenic PD, we further filtered out synonymous single-nucleotide polymorphisms (SNPs) and mutations that are classified as “not pathogenic” in the Parkinson Disease Mutation Database (Cruts et al., 2012). In a final filtering step, we removed variants with a frequency above 0.03 in the Exome Variant Server (ESP4500) or 1000 genomes (1000g2012feb) databases (Fig. 1).

Figure 1.

Overview of variant filtering.

The figure indicates how 60 initially called variants were processed through a series of filtering steps to a final list of 17 validated, potentially pathogenic variants.

Positive Controls and Validation

To serve as positive controls, 11 exonic variants in SNCA, PARK2, PINK1 and LRRK2 were genotyped prior to sequencing in 350 individual samples by MALDI-TOF mass spectrometry using the Sequenom MassARRAY system. For all variants passing filters, we validated mutations and identified individual carriers by Sanger sequencing of all samples represented in the positive pools. We used Primer3Plus to design primers for PCR amplification and sequencing of relevant exons. Primers and conditions are available on request. Amplicons were sequenced bidirectionally and Sequencher 5.1 (Gene Codes Corporation, Ann Arbor, MI, USA) software was used for data analysis.

Results

Coverage and Mismatch Rate

The six monogenic PD genes we analysed in this study comprise a total of 99 exons, corresponding to a target region of 14 kb. The total coverage across this region was 2.5 Gb, which equals each position on average being sequenced to a depth of approximately 4600x for each of the pools. Benchmarks used when assessing coverage in pooled sequencing vary considerably between previously published studies. As described below, we did not observe missed alleles when depth was above 80x, corresponding to each allele being read on average four times. The proportion of bases covered above 80x in 80% of pools was 95% for the whole target region and above 94% for each individual gene. The performance of each pool is illustrated in Figure 2. Setting a stricter coverage benchmark, at least 200x or 10 reads of each allele on average, decreases these measures only slightly, to 93% of the total target and above 92% for each gene. We note that one out of 39 pools achieved markedly less coverage than the rest of the samples, with only 79% of the target region covered at 80x (Fig. 2, Pool 18). The same pool was also characterized as having the lowest library quality as analysed by Bioanalyzer prior to sequencing.

Figure 2.

Coverage of target region across pools.

The figure shows the proportion of targeted exonic positions covered above 80x depth for each of the 39 pools.

Next, we assessed the occurrence of complete coverage gaps in the data. The best performing pools had reads mapping to 98.8% of the target region, corresponding exactly to the theoretically predicted coverage of the HaloPlex kit (data not shown). Coverage gaps occurred mainly in LRRK2 exons 9, 20 and 32, and in VPS35 exon 14 (Table 1).

Table 1. Depth and coverage statistics
GenePARK7PINK1SNCAPARK2LRRK2VPS35Total
  1. 1The GATK tool DiagnoseTargets was used to flag exons if more than 20% of pools lacked coverage for 10% of positions.

Percentage of exonic positions theoretically targeted by HaloPlex capture design10099.810099.598.598.298.8
Percentage of total exonic positions sequenced to minimum 80x in 80% of samples10097.710095.195.694.895.2
Average depth per exonic position per sample pool5627417658894581492835704627
Number of exons68512511799
Exons flagged for coverage gaps1    Exons 9, 20 and 32Exon 14 

Pooled sequencing relies on the ability to distinguish a true allele occurring on a small proportion of reads from the background error rate of the experiment. In Illumina 100 bp sequencing, the quality tends to fall towards the end of reads. This might make a subset of targeted positions vulnerable to higher error rates, given that the PCR-based HaloPlex protocol produces many identical reads, with only few overlaps per position. We therefore applied strict trimming of low-quality ends of reads in the raw data (Table S1). Furthermore, we assessed the mismatch rate, meaning the proportion of bases differing from reference, calculated across each instrument cycle. There was no systematic relationship between mismatch rate and read position (data not shown). The overall mismatch rate for the experiment was 0.13%.

Sensitivity

To assess the sensitivity of the pooled sequencing experiment, we compared results from genotyping of 350 patients with called variants from sequencing of the corresponding 35 pools. A detailed account of this comparison is given in Table S2. We genotyped 11 exonic SNPs, but for three of these, all patients were homozygous for the reference allele. For the remaining eight variants, we observed heterozygous carriers, with the total number of nonreference alleles in the data set ranging from 1 to 205 for each SNP.

The most fundamental aspect of sensitivity in this design will be the experiment's ability to detect a variant where only one allele is present in a pool. We tested 59 such single-allele pools representing seven different SNPs and a variant was called in 57 cases, corresponding to a sensitivity of 97%. The two cases where sequencing failed concerned rs55774500 in exon 3 of PARK2, and depth was below 80x at this position for both pools where the variant was missed. Consequently, if we restrict the analysis to pools achieving a depth above 80x at the relevant position, the observed sensitivity was 100%. To validate findings, we performed a total of 210 individual bidirectional Sanger sequencing analyses across 12 different exons. We observed no additional variants missed by pooled NGS in this Sanger sequencing data, amounting to a total of more than 37 kb.

Where a pool contains multiple nonreference alleles for a given SNP, the variant calling algorithm will calculate the most probable allele count. For studies concerned with rare variant detection, the accuracy of this estimate is less important, especially if positive pools are reanalysed by Sanger sequencing. However, in study designs where pooled sequencing is used to estimate allele frequencies of common variants, this quantitative aspect is crucial. We tested 56 instances where multiple nonreference alleles of an SNP, from 2 up to 10, were present in a pool. We observed that the number of nonreference alleles was called correctly in a majority of cases, but could vary with up to two alleles more or less as compared to genotyping (Table S2). The precision tended to fall with a higher proportion of nonreference alleles in a tested pool. No variants were called where pools were negative for an SNP as assessed by genotyping.

Filtering and Detection of Variants

Initial variant calling identified 60 variants in the data set passing the basic quality threshold. Subsequently, we performed a series of filtering steps retaining 18 nonsynonymous, low-frequency variants of potential pathogenic significance (Fig. 1). Out of these 18 variants, two (PARK2 p.Ala82Glu and LRRK2 p.Gly2019Ser) were already genotyped as positive controls, and one (LRRK2 p.Asn1437His) was found to have been previously detected in a diagnostic setting. Sanger sequencing confirmed 14 out of the 15 remaining variants and individual carriers were identified in each of the implicated pools. One single variant was not validated by Sanger sequencing and discarded as a false positive, corresponding to an observed specificity of 94% in our experiment. The final set of 17 variants is listed in Table 2. Based on presence in dbSNP137, Exome Variant Server and 1000 genomes databases, 11 variants were previously reported and 6 were novel. The novel variants were submitted to the ClinVar database (Landrum et al., 2014). We note that although we chose to filter out synonymous SNPs in this study, this class of variants may also contribute to disease through mechanisms such as splicing and RNA processing (Sauna & Kimchi-Sarfaty, 2011). Such effects are difficult to predict for variants detected in a screening context, but would have to be considered in a complete analysis of possibly pathogenic mutations.

Table 2. Results from targeted pooled sequencing of Mendelian PD genes
Genomic coordinateGene1Base changeAmino acid changedbSNP137ClinVar submissionESP database frequency2Allele count3
  1. The table lists our final set of validated variants of potential pathogenic significance detected in 387 patients by pooled resequencing of Mendelian PD genes.

  2. 1Genomic coordinates refer to build GRCh37. Relevant RefSeq accession numbers for mRNA and protein are: PARK7: NM_007262.4, NP_009193.2; PINK1: NM_032409.2, NP_115785.1; PARK2: NM_004562.2, NP_004553.2; LRRK2: NM_198578.3, NP_940980.3; VPS35: NM_018206.4, NP_060676.2.

  3. 2ESP database frequencies are given as proportions of 1.

  4. 3All mutation carriers were heterozygotes.

1:8037783PARK7c.394A>Gp.Lys132Glurs200894731  1
1:8037788PARK7c.399G>Cp.Met133Ile SCV000114936 1
1:20964591PINK1c.644C>Tp.Pro215Leu SCV000114938 1
1:20971129PINK1c.923T>Ap.Leu308Gln SCV000114939 1
1:20971158PINK1c.952A>Tp.Met318Leurs139226733 0.000933
1:20975105PINK1c.1231G>Ap.Gly411Serrs45478900 0.0010221
6:161771219PARK2c.1310C>Tp.Pro437Leurs149953814 0.0021384
6:162394367PARK2c.701G>Ap.Arg234Glnrs144032774 0.0001861
6:162683724PARK2c.245C>Ap.Ala82Glurs55774500 0.0015816
12:40631852LRRK2c.518A>Gp.Asn173Serrs201415714  1
12:40677787LRRK2c.2352C>Ap.Ser784Arg SCV000114940 1
12:40688681LRRK2c.2843G>Ap.Arg948Glnrs200442352  1
12:40692281LRRK2c.3333G>Tp.Gln1111Hisrs78365431 0.0000931
12:40703027LRRK2c.4309A>Cp.Asn1437Hisrs74163686  1
12:40734202LRRK2c.6055G>Ap.Gly2019Serrs34637584 0.0004653
12:40758762LRRK2c.7300A>Gp.Ile2434Val SCV000114941 1
16:46702913VPS35c.1576C>Tp.Arg526Cys SCV000114937 1

Discussion

In this study, we performed pooled deep sequencing of 387 PD samples, using HaloPlex PCR-based target enrichment to capture relevant genomic regions. We identified 11 known and 6 novel variants in six Mendelian PD genes and evaluated the method by coverage assessment and validation against SNP genotyping and Sanger sequencing. We suggest that the combination of targeted capture and DNA pooling may be a cost-effective and scalable approach for variant detection in large sample sets in a research setting. In Table 3, we summarize key aspects of a pooled NGS design as compared to a traditional Sanger sequencing approach.

Table 3. Comparison of study designs for rare variant detection in disease cohorts
Individual Sanger sequencingPooled targeted next-generation sequencing
Project may develop over timeProject relies on the initial design
Requires a series of experiments to analyse individual exons in individual samplesExons and samples in the range of hundreds, or even thousands, are analysed in parallel in a single experiment
Accumulating costs and workload with increasing data generationOne initial investment effectively generates large amounts of data
Large-scale projects very labour-intensive and time-consumingEasily scaled to projects involving large sample sets and extensive target regions
PCR primers successively tailored to amplify full target regionTarget enrichment may be incomplete in challenging genomic regions
Highly accurate results, method also applicable in a diagnostic settingSensitivity and specificity adequate in the context of clinical research
Data analysis straightforward with conventional toolsRequires more advanced bioinformatic processing of data
Validation of findings generally not requiredMay require validation of findings by independent methods

A targeted pooled sequencing strategy was recently used successfully to identify rare causal variants in breast and ovarian cancer (Ruark et al., 2013), as well as short stature (Wang et al., 2013). These studies investigated large numbers of candidate genes, more than 1000 in the latter case. In contrast, a study of Alzheimer's disease applied pooled sequencing to screen for mutations in only five genes (Cruchaga et al., 2012). We designed a panel sufficiently large to entail all genes implicated in previous genetic studies of PD and related disorders, but not extensive “new” candidate gene sets.

The main motivation for creating DNA pools is to reduce the costs related to target enrichment and sequencing, which still represent a substantial barrier for larger projects. While some laboratory effort is required for the preparation of equimolar pools, the manual workload is considerably reduced downstream, as compared to individual sequencing. A major advantage of the HaloPlex-targeted enrichment system is the convenient laboratory workflow, integrating both capture and library preparation. The protocol allows one person to prepare a set of finished libraries within two working days, and requires no larger specialized laboratory instruments.

Barcoding offers the opportunity to mix DNA libraries on the same flow cell lane, while allowing raw data to be extracted for each individual library. We note, however, that our pooled design maximizes efficiency on two different levels by sequencing barcoded pools, which reduces the experiment costs considerably more than barcoding alone. Savings are most significant for the target enrichment, which is currently the most expensive step in this kind of study, but efficiency is also further optimized for the sequencing, utilizing the whole capacity of the instrument run. For our 200 kb target, ample coverage for all samples was obtained on only two lanes of the Illumina HiSeq.

We achieved coverage above 80x for 96% of exonic positions in 80% of samples, compared to 98.8% theoretically targeted by our capture kit. This corresponds to high performance regarding both design efficiency and target enrichment efficiency, as compared to previous successful pooled sequencing studies (Rivas et al., 2011; Ruark et al., 2013). By comparison, frequently used capture technologies based on random DNA shearing and hybridization will typically miss considerably more of the region of interest due to limitations in design, even though deep sequencing yields good coverage of the actual baits in the kit (Hedges et al., 2011; Kiialainen et al., 2011).

Bearing this in mind, we also recognize that the target capture in our study has limitations. HaloPlex technology relies on restriction sites and customized hybridizing probes, which may be difficult to tailor for some genomic regions, thus limiting design efficiency. Our data show that while some reads were present for the total theoretical target, a minor proportion of the region failed to achieve sufficient coverage. Considering that sequence depth per position in each pool was 4600x on average, and that varying the coverage benchmark made little impact on the statistics, it is unlikely that increasing the total sequencing depth would rescue these poorly covered fragments.

The phenomenon of variable relative coverage is well described in the NGS literature (Ross et al., 2013). Loci with extreme base compositions (i.e., highly GC- or AT-rich regions) are known to be the major cause of low relative coverage, where library amplification by PCR has been shown to be the most critical step (Aird et al., 2011). In our data, we observed that a small subset of pools, and one in particular, showed markedly lower coverage performance than the majority of pools. This suggests that factors such as subtle variation in the quality of input DNA or the inevitable methodological variability introduced by manual processing of samples may impact the enrichment efficiency in these most vulnerable target regions.

Validation against positive control genotypes demonstrated how stretches of inadequate coverage might lead to false negatives, as our pooled NGS experiment missed two alleles where sequence depth was low. However, if the analysis is restricted to adequately covered loci, our observed sensitivity was 100%. This indicates that the principle of pooling DNA from 10 individuals prior to target enrichment is compatible with excellent sensitivity, whereas vulnerability to false negatives primarily depends on the enrichment efficiency.

While this study is concerned with effective mutation screening in large sample sets for research purposes, the required level of sensitivity would be higher in the context of clinical diagnostics. A recent study reported the performance of individual targeted NGS for clinical diagnosis of hereditary cardiomyopathy (Sikkema-Raddatz et al., 2013). Achieving 99% coverage in their experiment, the authors recommend that diagnostic NGS should be supplemented by Sanger sequencing where relevant genes are inadequately covered. A similar approach might also be considered for mutation screening in a research context, if sensitivity is to be increased further.

Regarding the specificity, the pooled design relies on accurate distinction of true nonreference alleles occurring in a low proportion of reads from the background of sequencing errors. The empirical mismatch rate entails all deviations from reference, both biological variation and miscalls of technical origin, although sequencing errors will dominate quantitatively in highly conserved regions where true variants are rare. We observed a total mismatch rate of 0.13% in our experiment. This figure corresponds well to reported estimations of error rate in Illumina sequencing (Glenn, 2011), and is far below the expected 5% read frequency of a single allele in our pools of 10 individuals. It has been shown, however, that the distribution of errors is not random, but tends to follow specific patterns (Nakamura et al., 2011). While such sequence-specific susceptibility to errors may give rise to false-positive variant calls, the data quality metrics will typically be low in such loci, allowing for bioinformatic filtering to increase specificity.

Much like for NGS in general, there is currently no standardized pipeline or established consensus on the optimal bioinformatic strategies for the analysis of pooled sequencing data. It may be worth noting that restriction enzyme digestion and PCR produces sequence data with many identical reads, in contrast to reads from randomly fragmented DNA. This will affect some of the quality-control metrics that are often exploited in variant filtering. The parameters and strategies for quality filtering will have to be tuned to the relevant hypothesis and design. For this study, we wanted to prioritize a high sensitivity and we therefore chose loose parameters that filtered out only a few variants. Still, all but one variant were confirmed by Sanger sequencing, corresponding to a specificity of 94%. For other pooled sequencing applications, one might be more concerned with high specificity and apply stricter filtering. A weakness in our study design is the need to follow up positive findings by Sanger sequencing of all 10 individuals in a pool. We note, however, that this step may not be necessary for all pooled NGS experiments. As an example, a recent study of rheumatoid arthritis used pooled sequencing to identify rare risk variants in a design that relied on high-quality calls without validation or identification of individual carriers (Diogo et al., 2013).

The Genome Analysis Toolkit provides a comprehensive framework for sequence data processing and variant detection that will be familiar to many researchers involved in NGS analysis. Since the GATK version 2.0, an extension to the UnifiedGenotyper tool has been implemented to support variant calling from pooled samples. Noting that it represents only one among a range of available software tools (Bansal, 2010; Rivas et al., 2011; Vallania et al., 2012), we would argue that our results demonstrate the capacity of UnifiedGenotyper to effectively and reliably call SNPs from pooled sequencing. A clear weakness concerning this study is that our set of positive control genotypes did not include any small indels. Indel calling in NGS is generally more challenging than SNP calling, and could be expected to be even more difficult in the context of pooled sequencing. The detection of larger copy number variations (CNVs) would require different methods. We note that for PARK2 and SNCA in particular, a substantial proportion of pathogenic mutations are CNVs (Nuytemans et al., 2010).

With the exception of possible undetected CNVs, our experiment gives an overview of potentially pathogenic Mendelian PD gene mutations in a sample set of 387 patients. While we identified 17 variants from 38 different carriers, only a small proportion of these will have monogenic PD. For most of the listed variants, the pathogenic potential is currently uncertain. Twenty-eight patients had heterozygous mutations in recessive genes PARK7, PINK1, or PARK2. It remains unclear to what extent the carrier state of such mutations represents a risk factor for PD (Nuytemans et al., 2010).

Regarding the dominant genes, four patients had highly penetrant LRRK2 mutations (three p.Gly2019Ser, one p.Asn1437His) (Ross et al., 2011). The remaining LRRK2 variants have an unknown role in relation to PD. Coding variants in SNCA are very rare, but recently a few novel pathogenic mutations have been reported (Appel-Cresswell et al., 2013; Lesage et al., 2013; Proukakis et al., 2013). We found no SNCA variants in our sample set.

From our final set of 17 validated variants, 6 were novel mutations not previously reported. This demonstrates the capacity of the pooled sequencing approach to identify new variants. One novel variant was found in VPS35. This gene was implicated in dominant PD more recently (Vilarino-Guell et al., 2011; Zimprich et al., 2011). The p.Arg526Cys mutation is classified as damaging by prediction tools SIFT and MutationTaster, but not by PolyPhen2, and was detected in a patient without known family history of PD. It remains to be seen whether this variant is found in other individuals with or without PD.

In this study, we analysed sequence data from 6 genes included in a targeted enrichment panel of 71 genes in total. Having gained experience with the method and demonstrated its ability to reliably detect rare coding variants, we will make use of the remaining sequence data in future projects to explore various hypotheses related to the genetic architecture of Parkinsonism.

In summary, we have demonstrated how deep sequencing of genomic target regions in DNA pools may be a cost-effective, scalable and reliable design for NGS studies in a research context. The example of PD, a complex disorder where a minority of patients have Mendelian forms of the disease, serves to show how targeted pooled sequencing can be useful to pursue a range of scientific hypotheses, concerning both complex and monogenic conditions.

Acknowledgements

L. Pihlstrøm is supported by a grant from South-Eastern Norway Regional Health Authority, Norway. M. Toft and A. Rengmark are supported by grants from the Research Council of Norway. Patient recruitment and sample collection has also been funded by the Norwegian Parkinson Research Fund and Reberg's Legacy. The authors declare no conflicts of interest.

The authors thank research nurse Lena Pedersen, Department of Neurology, Oslo University Hospital for assistance in sample collection and handling, Zafar Iqbal, Department of Neurology, Oslo University Hospital for comments to the manuscript, and Tim Hughes, the Norwegian Sequencing Centre, for bioinformatic advice. The sequencing service was provided by the Norwegian Sequencing Centre (http://www.sequencing.uio.no), a national technology platform hosted by the University of Oslo and supported by the “Functional Genomics” and “Infrastructure” programs of the Research Council of Norway and the South-Eastern Norway Regional Health Authority.

Web Resources

The URLs for databases and software referenced in this article are as follows:

Online Mendelian Inheritance in Mam (OMIM), http://omim.org

1000 Genomes, http://www.1000genomes.org

NHLBI Exome Sequencing Project (ESP), http://evs.gs.washington.edu/EVS/

ClinVar, http://www.ncbi.nlm.nih.gov/clinvar/

dbSNP, http://www.ncbi.nlm.nih.gov/projects/SNP/

Burrows-Wheeler Aligner, http://bio-bwa.sourceforge.net

Picard, http://picard.sourceforge.net

The Genome Analysis Toolkit, http://www.broadinstitute.org/gatk/

Parkinson Disease Mutation Database, http://www.molgen.ua.ac.be/PDmutDB/

ANNOVAR, http://www.openbioinformatics.org/annovar/

Trimmomatic, http://www.usadellab.org/cms/?page= trimmomatic

SIFT, http://sift.jcvi.org

PolyPhen2, http://genetics.bwh.harvard.edu/pph2/

MutationTaster, http://www.mutationtaster.org

Primer3Plus, http://primer3plus.com

Ancillary