Diagnosis of autosomal dominant polycystic kidney disease using efficient PKD1 and PKD2 targeted next-generation sequencing

Molecular diagnostics of autosomal dominant polycystic kidney disease (ADPKD) relies on mutation screening of PKD1 and PKD2, which is complicated by extensive allelic heterogeneity and the presence of six highly homologous sequences of PKD1. To date, specific sequencing of PKD1 requires laborious long-range amplifications. The high cost and long turnaround time of PKD1 and PKD2 mutation analysis using conventional techniques limits its widespread application in clinical settings. We performed targeted next-generation sequencing (NGS) of PKD1 and PKD2. Pooled barcoded DNA patient libraries were enriched by in-solution hybridization with PKD1 and PKD2 capture probes. Bioinformatics analysis was performed using an in-house developed pipeline. We validated the assay in a cohort of 36 patients with previously known PKD1 and PKD2 mutations and five control individuals. Then, we used the same assay and bioinformatics analysis in a discovery cohort of 12 uncharacterized patients. We detected 35 out of 36 known definitely, highly likely, and likely pathogenic mutations in the validation cohort, including two large deletions. In the discovery cohort, we detected 11 different pathogenic mutations in 10 out of 12 patients. This study demonstrates that laborious long-range PCRs of the repeated PKD1 region can be avoided by in-solution enrichment of PKD1 and PKD2 and NGS. This strategy significantly reduces the cost and time for simultaneous PKD1 and PKD2 sequence analysis, facilitating routine genetic diagnostics of ADPKD.

is a PhD student supported by the Spanish Ministry of Economy and Competiveness. R. T. is supported by Intensification Programm of Research Activity ISCIII/ Generalitat de Catalunya (programm I3SN). We acknowledge support of the Spanish Ministry of Economy and Competitiveness, 'Centro de Excelencia Severo Ochoa 2013-2017', SEV-2012-0208.

Introduction
Autosomal dominant polycystic kidney disease (ADPKD; OMIM IDs: 173900; 613095) is the most common inherited kidney disease, with an incidence of 1 in 400-1000 (Iglesias et al. 1983;Dalgaard and Norby 1989). ADPKD is caused by mutations in PKD1 (16p13.3; OMIM ID: 601313) in approximately 85% of the cases (The European Polycystic Kidney Disease Consortium 1994), and in PKD2 (4q21; OMIM ID: 173910) in the remaining 15% (Mochizuki et al. 1996). ADPKD is characterized by the development and progressive enlargement of cysts in the kidneys and other organs, eventually leading to end-stage renal disease (ESRD). The ADPKD phenotype displays a significant variability that is greatly influenced by the affected gene. Thus, PKD1 patients have a median age at ESRD of 58 years compared to 79 years for PKD2 mutated patients (Cornec-Le Gall et al. 2013).
Diagnosis of ADPKD is mainly performed by renal imaging such as ultrasonography, computed tomography, or magnetic nuclear resonance (Pei et al. 2009). However, molecular diagnostics is necessary in several situations: (1) when a definite diagnosis is required in young individuals, such as a potential living related donor in an affected family with equivocal imaging data; (2) in patients with a negative family history of ADPKD, because of potential phenotypic overlap with several other kidney cystic diseases; (3) in families affected by early-onset polycystic kidney disease, since in this cases hypomorphic alleles and/or oligogenic inheritance can be involved (Rossetti et al. 2009;Bergmann et al. 2011;Harris and Hopp 2013); and (4) in patients requesting genetic counseling, especially in couples wishing a preimplantation genetic diagnosis (Harris and Rossetti 2010).
Approximately 70% of the 5 0 genomic region of the PKD1 gene (exons 1-33) is duplicated six times on chromosome 16p within six pseudogenes (PKD1P1 to PKD1P6), which share a 97.7% sequence identity with the genuine gene (Bogdanova et al. 2001;Rossetti et al. 2012). This, together with a high GC content, the presence of many missense variants, the absence of mutation hot spots, and the high allelic heterogeneity of ADPKD, makes the molecular diagnostics of ADPKD challenging. In addition, most mutations are private variants, with a total of 1272 pathogenic PKD1 and 202 pathogenic PKD2 mutations reported to date (March 2014, ADPKD Database [PKDB], http://pkdb.mayo.edu). Thus, genetic diagnosis by conventional techniques of a new ADPKD family requires longrange polymerase chain reaction (LR-PCR) of the repeated region of PKD1 followed by nested PCRs (Rossetti et al. 2002), combined with Sanger sequencing of all 46 PKD1 and 15 PKD2 exons. When pathogenic mutations are not identified by Sanger sequencing, multiplex ligationdependent probe amplification (MLPA) analysis is also performed to identify potential insertions and deletions.
Therefore, there is a demand for more simple and costeffective molecular approaches that could be used for routine diagnosis, especially now with the coming specific therapies that will require differential genetic diagnosis (Torres and Harris 2006). To address these challenges, we have developed and validated an assay that couples genome partitioning and next-generation sequencing (NGS), to comprehensively perform in one-step mutation screening in PKD1 and PKD2, as an alternative to cumbersome conventional genetic testing methods.

Material and Methods
Subjects High-quality genomic DNA from 53 unrelated patients was obtained from blood lymphocytes, using standard protocols. The validation cohort included 36 ADPKD patients and five control individuals that had previously undergone conventional genetic diagnosis by Sanger sequencing of all PKD1 and PKD2 exons and, if negative, MLPA was also applied. The discovery cohort consisted of 12 ADPKD consecutive patients received for genetic diagnosis for which no mutations were known. ADPKD diagnosis was based on standard clinical and imaging criteria. Blood samples were obtained from other family members if they were available. All samples were codified and bioinformatics mutation analysis was blindly performed. Signed informed consent was obtained for all participants. This study was approved by the institutional review board.
Capture and multiplexed sequencing of the PKD1 and PKD2 genes To carry out DNA capture, we designed a custom Nim-bleGen SeqCap EZ Choice Library (Roche, Inc., Madison, WI) to target the complete genomic sequence of the PKD1 and PKD2 genes, and 1 kb of genomic sequence flanking at the 5 0 and 3 0 ends of each gene, accounting for 121,322 bp. Our design also included probes to target additional genes related to other human inherited diseases, for a total of 2.1 Mb of captured DNA after removal of repetitive sequences. DNA probes were selected using the most stringent settings for probe design (uniqueness tested by Sequence Search and Alignment by Hashing Algorithm [SSAHA]) (Ning et al. 2001). However, in order to be able to generate capture probes for the duplicated PKD1 regions, we altered the settings for probe design of this specific region to allow probes to have up to 10 close matches in the genome. No probe redundancy was allowed in the final capture design for the rest of target regions. The Browser Extensible Data file of captured regions is available on request to the authors.
Libraries were prepared with the TruSeq DNA Sample Preparation Kits (Illumina, Inc., San Diego, CA). Genomic capture from pooled libraries was carried out using NimbleGen SeqCap EZ Library (Roche, Inc.) following User's Guide v3.0 instructions, as previously described (Trujillano et al. 2013). The libraries of the patients of the validation cohort and the five controls were prepared and sequenced together with seven samples of other diseases using the same capture design and enrichment pro-tocol in two pools of 24 samples, for a total of 48 samples multiplexed in two HiSeq 2000 (Illumina, Inc.) lanes to generate 2 9 100 bp paired-end reads. The 12 patients of the discovery cohort were enriched in a single capture reaction and were sequenced in a Miseq (Illumina, Inc.) run to generate 2 9 250 bp paired-end reads.

Bioinformatics analysis and mutation identification and classification
The resulting fastq files were analyzed with an in-house developed pipeline previously described (Trujillano et al. 2013). All the bioinformatics tools used in this study were run using default settings unless stated otherwise. For the patients included in this study, only the sequencing data produced for PKD1 and PKD2 were analyzed, as stated in the signed informed consent. The reference sequences used were NM_001009944.2 for PKD1 and NM_000297.2 for PKD2. In order to identify pathogenic mutations that could cause ADPKD, we applied the following cascade of filtering steps (Walsh et al. 2010): 1 We required all candidate variants on both sequenced DNA strands and to account for ≥20% of total reads at that site in order to filter out spurious variant calls caused by misaligned reads in the duplicated region of PKD1. 2 Common polymorphisms (≥5% in the general population) were discarded by comparison with dbSNP 137, the 1000G, the Exome Variant Server (http://evs.gs. washington.edu), and an in-house exome variant database to filter out both common benign variants and recurrent artifact variant calls, especially in the duplicated PKD1 regions. However, since these databases also contain known disease-associated mutations, all detected variants were compared to gene mutation databases (The Human Gene Mutation Database [HGMD], www.hgmd.cf.ac.uk and ADPKD Database [PKDB], http://pkdb.mayo.edu). 3 Mutations that could give rise to premature truncated proteins, that is, stop mutations, exonic deletions/insertions, and large genomic rearrangements were classified as definitely pathogenic. 4 Missense and noncanonical splicing variants were considered a priori Unclassified Sequence Variants (UCV) and their potential pathogenicity was evaluated using an in silico scoring system developed for PKD1 and PKD2 genes as previously described (Rossetti et al. 2007). This scoring system takes into consideration a number of in silico predictors (Grantham 1974;Tavtigian et al. 2006;Rossetti et al. 2007) and population data. We scored each of these factors, the sum of which resulted in an overall Variant Score (VS). The UCV were classified into four groups: highly likely pathogenic (VS ≥ 11); likely pathogenic (5 ≤ VS ≤ 10), indeterminate (0 ≤ VS ≤ 4), and highly likely neutral (VS ≤ À1) (Rossetti et al. 2007).
We considered to be pathogenic mutations those sequence variants predicted to result in a truncated protein (classified as definitely pathogenic) and those not found in healthy controls, that segregated with the disease in families and expected to severely alter the protein sequence using in silico predictors (classified as highly likely pathogenic and likely pathogenic variants).
If no pathogenic mutations were identified, the bioinformatics pipeline automatically reported the target sequences that presented low or inexistent sequence coverage. These regions were screened by Sanger sequencing since they were more likely to contain the pathogenic variants missed by our NGS approach. Validation of newly identified single-nucleotide variants (SNVs) was performed by Sanger sequencing.

PKD1 and PKD2 enrichment
Eighty-one percent of the PKD1and PKD2-targeted bases could be covered with capture baits for a final targeted region of 98,524 bp divided into 99 individual regions, with lengths ranging from 65 to 6,493 bp (average of 995 bp) (Table S1). Noteworthy, 100% of all coding sequences, that is, the complete 46 and 15 exons of PKD1 and PKD2, respectively, were covered by capture baits. The target regions that precluded bait tilling correspond only to intronic and intergenic sequences.

Sequencing statistics
In the validation cohort, an evenly distributed mean depth of coverage of 331X and 481X for PKD1 and PKD2 was achieved, respectively, on average across samples (Table 1). We achieved a sequencing depth of 289X for the 46 exons of PKD1 and 453X for the 15 exons of PKD2, on average across samples (Table S2). Ninety-five percent of the coding base pairs of PKD1 and 94% of PKD2 were covered by more than 20 reads, which is enough for an accurate detection of known and novel mutations. Only exons 1 and 42 of PKD1 and exon 1 of PKD2 were not captured and sequenced at an adequate read depth (Fig. 1).
Due to the lower throughput of the MiSeq sequencer, the average coverage achieved in the discovery cohort was of 81X and 174X for PKD1 and PKD2, respectively, across the 12 samples (Table 1). For a comprehensive summary of the obtained sequencing results, see also Tables S3, S4.

Detection of PKD1 and PKD2 mutations in the validation cohort
For the validation cohort we selected samples with as many different types of PKD1 and PKD2 mutations as possible, including SNVs, short insertions and deletions (InDels), and large structural variants (SVs). We identified 35 out of 36 previously known different pathogenic mutations (30 in PKD1 and five in PKD2) in their correct heterozygous state (Table 2). These results would have led to a diagnostic rate of 97.2%. Noteworthy, 25 (70%) of these mutations were spread along different exons within the segmentally duplicated regions of the PKD1 gene, highlighting the robustness of our approach even for genes with highly homologous pseudogenes. In addition, two previously known large deletions were correctly detected. Concretely, patient 03-106-P6 presented PKD1 g.2154344-2186386del ( Fig. 2A), and patient 11-571-P2 presented PKD2 g.88952828-89050618del (Fig. 2B). For the unique patient with a previously known mutation not identified by our NGS assay, 03-393-P3, manual inspection of the sequence alignment files revealed that the missing p.(Met1fs) was localized in a region of exon 1 of PKD1 for which no NGS reads were available due to problems with the genomic capture. We included in this study five control individuals without personal or family history of ADPKD to determine the clinical specificity of our assay. These controls had been previously genotyped with a HumanOmni 2.5-8 BeadChip (Illumina, Inc.) and were also used to determine the analytic sensitivity of our assay to detect heterozygous and homozygous SNVs. Genotype data were available for a total of 80 and 269 sites within the targeted regions of PKD1 and PKD2, respectively. Sensitivity was of 100% both for PKD1 (20/20) and PKD2 (103/103). Analytic specificity was 100% both for PKD1 (60/60) and PKD2 (166/166) (Table S5). Of note, no spurious pathogenic calls were detected in either the control individuals or the validation cohort.

Identification of PKD1 and PKD2 mutations in the discovery cohort
We detected pathogenic mutations in 10 out of 12 patients carrying a total of 11 different pathogenic mutations (10 in PKD1 and one in PKD2), which lead to a diagnostic rate of 83.3%. All variants were confirmed by Sanger sequencing (Table 3). Interestingly, one patient (12-444) harbored one definitively pathogenic mutation in PKD2 and one highly likely pathogenic mutation in PKD1, presenting a more severe phenotype compared to the rest of the family. For the two samples in which no pathogenic variants were identified with our NGS assay, the bioinformatics pipeline proposed a list of candidate regions with low sequence coverage that were screened by Sanger sequencing and the two causal mutations were identified. Then, by manually inspecting the alignment files of the NGS reads, we realized that we had lost p.(Val2768Met) in patient 13-102 and p.(Arg4021fs) in patient 07-335 because their locations were in poorly covered areas of PKD1 and the variant calling and filtering algorithms had discarded them as potential false positives calls. Noteworthy, no spurious pathogenic calls were reported in any of the samples of the discovery cohort.

Discussion
It has been suggested that the impact that NGS technologies will have on clinical genetics during the upcoming years will be comparable to the introduction of X-rays to medicine many decades ago (Hennekam and Biesecker 2012). After the tremendous impact of NGS technologies in the discovery of disease-causing genes during the last 4 years, we are witnessing the introduction of these technologies for diagnostic applications, with the aim of rapidly revolutionize the field of genetic diagnostics, making it much more cost-and time-effective, advance accuracy, and point to unsuspected yet treatable conditions. The purpose of this study was to develop a cost-effective method for the molecular diagnostics of ADPKD applying targeted NGS. First, we validated the assay in a cohort of 36 previously characterized ADPKD patients in which we detected 35 out of 36 known mutations. Second, we analyzed a discovery cohort of uncharacterized ADPKD patients and we reached a diagnostic rate of 83% (10 out of 12 patients), allowing test reporting 5 days after receiving the DNA samples. Although the size of our cohort is modest, these results are very encouraging since these numbers represent a diagnostic rate comparable to data obtained by Sanger sequencing (Audrezet et al. 2012;Cornec-Le Gall et al. 2013) and NGS (Rossetti et al. 2012).
Recently, targeted sequencing by NGS has been used in the identification of mutations in ADPKD. Rossetti et al. (2012) did not apply a capture protocol for PKD1 and PKD2 enrichment since they speculated that the duplicated genomic regions of PKD1 would lead to concurrent capture of the six PKD1 pseudogenes making very difficult the identification of the ADPKD causal variants. Instead, these authors used a strategy of pooling equimolar LR-PCR amplicons and multiplexing barcoded libraries. Their approach showed a high sensitivity, specificity and accuracy, but it is a very laborious task more amenable to characterize large ADPKD populations than for routine genetic diagnosis. Moreover, their approach did not allow detecting large genomic rearrangements. Here, we do not only demonstrate that genome enrichment by in-solution hybridization using an elaborated probe design is an accurate strategy for mutation identification in the duplicated regions and the rest of PKD1 and PKD2, but also that this strategy is ready to substitute LR-PCR-based methods in the routine genetic diagnostics of ADPKD to detect all sorts of sequence variants, including SVs.
When we conceived this study, we assumed that it would be extremely difficult to specifically capture the genuine PKD1, that is, there would always be residual enrichment of the six pseudogenes. Therefore, instead of excluding this region from our assay we decided to include unspecific probes to the duplicated region of PKD1 in our capture library. In this regard, the mutation calls are on average lower than the reference calls, most likely due to the pseudogenes background (Tables 2, 3). From our point of view, the critical point of the assay was not the presence of sequencing reads coming from both the genuine PKD1 and its pseudogenes. Instead, the major challenge was to map the reads coming from duplicated regions unambiguously to the genuine PKD1 or to the six pseudogenes.
In order to minimize the impact of sequence reads coming from the pseudogenes we allowed mapping to the whole genome, instead of restricting the mapping to the targeted region. Moreover, the length of the millions of overlapping sequencing reads produced in this study (2 9 100 bp and 2 9 250 bp in the validation and discovery cohorts, respectively) in combination with the 300 bp insert sizes in the DNA libraries provided enough sequence specificity for accurate mapping and pseudogene discrimination, allowing us to unambiguously map a large proportion of the sequencing reads to PKD1 (Table 1). Furthermore, we also assume that the alignment algorithm is not 100% reliable and some reads coming from the pseudogenes could have been erroneously aligned to PKD1. In the worst scenario, the accumulation of these misaligned reads could lead to spurious variant calls but, as we have observed, none of these potential false positive variant calls passed the stringent filters of our variant prioritization pipeline in any of the patients of the validation and discovery cohorts neither in the five control samples.
The low sequencing coverage obtained for exons 1 and 42 of PKD1 and exon 1 of PKD2, likely due to a high GC content, is the main limitation of this study as all variants that were missed were located in these poorly covered regions highlighting the importance of achieving sufficient depth of coverage for the optimal performance of the assay. However, we think that this can be fixed by rebalancing and adding new and replicate probes hybridizing with these poorly covered regions in the capture design. In the discovery cohort, the low average depth of coverage yielded by the MiSeq (Illumina, Inc.) in some samples was the cause of the lower mutation detection rate of the assay. However, we plan in the future to produce an optimized capture design including only PKD1/PKD2 and a few cystic genes that would help in the differential diagnosis, such as HNF1B (17p12; OMIM ID: 189907) and PKHD1 (6p12.3-p12.2; OMIM ID: 606702). This would significantly reduce the total captured DNA per sample, allowing multiplexing more samples per MiSeq (Illumina, Inc.) run, and to achieve higher depths of coverage (comparable to those obtained for the validation cohort) that will allow more confident variant calling.
We estimate that with our NGS-based assay a 60% of cost savings per sample could be achieved, and the whole diagnostics process could be a minimum of five times faster than with the conventional techniques currently used for the genetic diagnostics of ADPKD. In addition, our strategy offers a complete definition of the captured genes, without the need for stepwise testing anymore and having to choose which gene to sequence first, and is capable to detect large genomic rearrangements and deep intronic variants. In the discovery cohort, the complete process of library preparation, genomic enrichment, NGS using a MiSeq (Illumina, Inc.), and bioinformatics analysis was completed in 5 days after reception of the DNA samples.
In conclusion, we illustrate here the first successful study using in-solution hybridization enrichment coupled to NGS to detect ADPKD pathogenic mutations, both in the duplicated regions of PKD1 and the rest of PKD1 and PKD2 genes. Our approach is cost-and time-effective, and meets the sensitivity and specificity criteria required for genetic diagnostics, providing NGS experimental and bioinformatics approaches ready to substitute classic molecular tools in routine genetic diagnostics of ADPKD.

Supporting Information
Additional Supporting Information may be found in the online version of this article: Table S1. Target regions for capture of the PKD1 and PKD2 genes. Table S2. Average coverage statistics of PKD1 and PKD2 exons in the validation cohort. Table S3. By sample sequencing of quality control and coverage statistics in the validation cohort. Table S4. By sample sequencing quality control and coverage statistics in the discovery cohort in sequencing of PKD1 and PKD2. Table S5. NGS of PKD1 and PKD2 versus genotyping calls in the five control samples.