We explored an approach to detect disease-causing sequence variants in 448 candidate genes from five index cases of autosomal dominant retinitis pigmentosa (adRP) by sequence DNA capture and next-generation DNA sequencing (NGS). Detection of sequence variants was carried out by sequence capture NimbleGen and NGS in a SOLiD platform. After filtering out variants previously reported in genomic databases, novel potential adRP-causing variants were validated by dideoxy capillary electrophoresis (Sanger) sequencing and co-segregation in the families. A total of 55 novel sequence variants in the coding or splicing regions of adRP candidate genes were detected, 49 of which were confirmed by Sanger sequencing. Segregation of these variants in the corresponding adRP families showed three variants present in all the RP-affected members of the family. A novel mutation, p.L270R in IMPDH1, was found to be disease causing in one family. In another family a variant, p.M96T in the NRL gene was detected; this variant was previously reported as probably causing adRP. However, the previously reported p.A76V mutation in NRL as a cause of RP was excluded by co-segregation in the family. We discuss the benefits and limitations of our approach in the context of mutation detection in adRP patients.
Retinitis pigmentosa (RP) is the most common form of inherited retinopathy [1-3], affecting more than 1.5 million people worldwide. RP displays all three modes of Mendelian inheritance: autosomal dominant retinitis pigmentosa (adRP), autosomal recessive retinitis pigmentosa (arRP), and X-linked retinitis pigmentosa (XLRP), as well as a digenic  and mitochondrial mode of inheritance [5, 6]. Mutations in nearly 20 genes have been associated with adRP [6-8]. In the last two decades, screening for mutations in candidate genes associated with adRP in individual sample patients has been carried out in different populations. Different methods, such as single-strand conformational polymorphism (SSCP), denaturing gradient gel electrophoresis (DGGE) or denaturing high-performance liquid chromatography (DHPLC) followed by direct genomic sequencing or, more recently, mutation arrays, have been used in surveys of mutations in adRP patients . In current clinical practice, sequencing of candidate genes involved in a disease in individual patient samples is becoming increasingly important in order to carry out genetic diagnostics.
The introduction of next-generation DNA sequencing (NGS) technology is becoming increasingly necessary for sequencing genes to characterize mutations causing a monogenic disease [10-12]. We have recently developed a cost-effective method (submitted for publication) to analyze for mutations in 12 common genes (accounting for over 95% of known mutations) associated with adRP, in a scalable Roche 454 GS Junior (Roche, Applied Science, Barcelona, Spain) benchtop sequencing platform that is feasible for the sequencing of a subset of genes in individual samples using the NGS technique.
In some Western populations almost 50% of molecular-diagnosed adRP patients are carriers of a known mutation in a candidate gene [7, 8, 13]. However, 50% of the patients analyzed did not show any causative mutation in known adRP-associated genes, suggesting a genomic alteration in a gene as yet uncharacterized. Thus, an important task is to characterize mutations in new genes that may cause adRP. Massively parallel sequencing of several candidate genes has been used in patients previously excluded for mutations in known candidate genes for adRP . This study analyzed 21 index adRP patients for mutations in 40 candidate genes by massive generation of polymerase chain reaction (PCR) amplicons and analysis by two different NGS platforms, and is the first to show the potential use of NGS in the detection of mutations in novel genes associated with adRP.
Although the complete genomic sequencing of one individual would be the most powerful approach to find a disease-causing mutation, the cost and efforts required still make it impractical for routine use in disease gene research. In the case of a candidate gene approach, the sequencing target can be made much smaller than the whole genome. For this purpose, the amplicon approach which uses specific primer pairs for multiplex target amplification by PCR may be effective for targeting a few tens of genes, but going beyond this scale requires more effort to design primers and optimize PCR reactions. To overcome the cost and effort of primer design, one major approach is array-based capture which relies on the ability to create a custom in situ synthesized oligonucleotide microarray for use as a collection of hybridization capture probes. This technology selects targeted sequences by hybridization to an oligonucleotide microarray and shows great potential for the efficient enrichment of specific large high-complexity genomic regions of interest [14, 15]. Recently, two different approaches using specific DNA hybridization and NGS in RP have been reported [16, 17].
We used DNA capturing and NGS in a group of five adRP index patients. Our approach consisted of the analysis of candidate adRP genes rather than a complete exome. Thus, we first selected as candidate genes those with specific major expression in the retina. However, we also included genes involved in the pre-mRNA splicing function, because mutations in some of these genes have been associated with adRP [18-22], although they are widely expressed. In a previous study,  we characterized a set of genes with differential expression in lymphocytes from controls and from RP patients with a mutation in the pre-RNA splicing factor PRPF8. These genes were also included in the array as candidate genes.
Materials and methods
Patients and samples
Written informed consent was obtained from all the patients prior to the study, which was conducted in accordance with the Declaration of Helsinki and approved by the internal Clinical Research Ethics Committee (CEIC) of the Hospital de Terrassa, Spain.
We selected five index cases of Spanish adRP families who had previously been studied and diagnosed clinically with adRP. Thus, we included families with RP transmission from male to male, or with identical symptoms and penetration in males and females, or after excluding X-chromosome linked RP by linkage analysis. Diagnosis of RP was carried out by ophthalmic examination, which consisted of a visual field, visual acuity, dark-adapted sensitivity and electroretinographic analysis, in accordance with previously established protocols (ISCEV) . Genomic DNA was prepared from peripheral blood lymphocytes of the five index adRP patients using QIAmp DNA Blood Mini kit (Izasa, Barcelona, Spain).
Selection of candidate genes
We designed a sequence-capture experiment with a 385K array which has a sequence capacity of 5 Mb. The array capacity used allows inclusion of only a limited number of genes. Consequently, we first selected a number of candidate genes possibly involved in adRP. The genes selected for DNA capture (Table S1) included autosomal genes that would be relevant in terms of specific expression or function in the retina.
An initial set of candidate genes was retrieved from the NCBI Gene database (http://www.ncbi.nlm.nih.gov/gene) by text search of adRP-related terms (retina, retinitis, eye, cone, rod, photoreceptor, macula, macular, ocular, blindness, visual perception, visual system, visual cycle, retbindin, Rho, Usher, opsin and fovea) in the gene/protein name field and the summary text. To find novel disease-associated eye genes, we also selected in the candidate gene list those genes previously reported in the database eye disease genes (NEIBank website (http://neibank.nei.nih.gov/cgi-bin/eyeDiseaseGenes.cgi) related with retinal dystrophies. Most of the genes related with syndromic forms of RP are unlikely to be involved in the cause of disease in the index patients or affected relatives as these patients had no other apparent pathology apart from dominant RP. Moreover, none of the known variants found in genes associated with syndromic RP has yet been associated with dominant RP. Consequently, although mutations in genes may show different phenotypes, we decided not to include most genes associated with syndromic RP in the candidate list of adRP genes.
Our list included 15 genes associated with adRP even if they had been partially or fully screened in the patients studied. Also included in the candidate list were those genes previously associated with arRP because some mutations in these genes may be inherited in a dominant mode, as in the case of NR2E3 [25, 26]. Genes with major or specific expression in the retina included those showing a restricted eye pattern in the NCBI Unigene database (http://www.ncbi.nlm.nih.gov/unigene) or being predominant in the NEI retina-specific library (http://neibank.nei.nih.gov/cgi-bin/showDataTablecgi?lib=NbLib0042); and predicted targets of retina-specific transcription factors  CRX [27, 28], NRL  and NR2E3 [30, 31].
Mutations in some genes that are involved in pre-mRNA splicing processing, although expressed ubiquitously, can cause adRP. Candidate genes involved in pre-mRNA splicing include those encoding the U4/U6.U5 tri-snRNP and associated Sm/LSm proteins ; and functional partners predicted by string (http://string-db.org) of adRP-related PRPF8, PRPF31, PRPF3, RP9 and ASCC3L1. In addition, genes that showed differential expression  or splicing (not published) in previous arrays comparing wild-type and PRPF8 mutant lymphocytes from RP patients were also included. Finally, genes associated with autosomal dominant macular dystrophy PRPH2 (peripherin/RDS) [33-37] or cone + cone rod dystrophies like GUCY2D, GUCA1 PDE6H or RIM1 [38-41] were also included in the collection (Table S1).
DNA capturing using NimbleGen array
A total of 448 selected genes (Table S1) were converted to chromosomal and exon coordinates using the UCSC human genome assembly hg18 (http://genome.ucsc.edu) and submitted to NimbleGen (Roche NimbleGen, Barcelona, Spain) to generate a 385K array to capture coding and flanking regions. For the array design, 385,000 unique probes 60–80 nucleotides in length across the coding and flanking sequences of 448 target genes were designed. Repetitive regions were not covered, overlapping target regions were merged into one, and regions were extended to 250 bp to increase capture efficiency. Uniqueness of probes was assessed with Sequence Search and Alignment by Hashing Algorithm (SSAHA) and an additional padding of 100 bases (offset) was added to both sides of probes in order to obtain an additional coverage. The final approved custom array covered 95% of the total 1,938,644 bases corresponding to 5397 targets.
DNA capture was performed in accordance with NimbleGen protocols of sequence capture. (NimbleGen Arrays User's Guide: Sequence Capture Array Delivery v3.2). Briefly, approximately 20 µg of genomic DNA of the index patient from each family was used in the preparation of DNA for sequence-capture hybridization. DNA was fragmented to a size range of 300–500 base pairs (bp) using a GS Nebulizer Kit (Roche, Barcelona, Spain). The fragmented DNA was purified and analyzed on a Bioanalyzer 2100 DNA Chip 7500 (Agilent, Barcelona, Spain). Fragment ends were polished with the use of T4 DNA polymerase and T4 polynucleotide kinase, and adapters were ligated onto the polished ends with T4 DNA ligase. Small fragments (<100 bp) were removed with the use of AMPure Beads (Izasa, Barcelona, Spain). The resulting library (5 µg) was hybridized to a custom 385K array with the use of the NimbleGen Sequence Capture Hybridization System of the NimbleGen System 4. The hybridized target DNA was washed and eluted with the use of a NimbleGen Wash and Elution Kit according to the manufacturer's instructions. The eluted sample was amplified by ligation-mediated PCR with the use of primers complementary to the sequence of the adaptors and purified according to the instruction manual (NimbleGene v3.2). This PCR reaction was performed in 20 cycles using a thermostable DNA polymerase blend (Expanded High Fidelityplus PCR system; Roche, Applied Science, Barcelona, Spain) to minimize sequence errors.
The average loci fold enrichment was estimated in each sequence capture sample by performing quantitative PCR on internal quality control (QC) loci (NimbleGen v3.2) before and after sequence capture enrichment and then calculating the relative changes in template concentration for those loci.
Next-generation sequencing and bioinformatics analysis
A total of 10 µg of target-enriched DNA from NimbleGen capture arrays was used to generate the libraries for SOLiD NGS according to the protocols of (Life Technologies, Madrid, Spain). Briefly, DNA amount (Qubit fluorometer, Life Technologies, Madrid, Spain), quality (Nanodrop, Thermo Fisher and Bioanalyzer, Agilent, Barcelona, Spain) and gel electrophoresis were checked. Libraries were prepared according to the protocols of Life Technologies for sequencing with SOLiD v4. The quality of the libraries was assessed with Qubit and the average size was determined using Bioanalyzer. Emulsion PCR to obtain microspheres for sequencing was also carried out according to Life Technologies protocols for SOLiD v4 sequencing.
Bioinformatics analysis pipeline
First, low quality reads (more than 25 nt with a quality value below 9) were removed from the initial dataset. Then, colorspace reads were aligned against the Human reference genome (NCBI37/hg19) using Bioscope (http://solidsoftwaretools.com) with default settings. Single nucleotide variants (SNVs) were called with Samtools  with the following filtering criteria: SNV quality of 20 (Phred-like score), genotype quality of 30 (Phred-like value) and minimum coverage of 9×. Insertion/deletion (Indel) calling was performed with the Small Indel tool, part of the Bioscope analysis suite. A unique filter of at least nine non-redundant reads was applied for indel detection. For prioritization, identified variants were annotated and classified according to their position or effect on transcripts using ‘in house’ Perl scripts to query the Ensembl database (release 59) which includes data from dbSNP, the 1000 Genomes project and the HapMap project along with other sources. In addition, the sift program  was used to identify potential protein damaging mutations. In SNVs, we distinguish single nucleotide polymorphisms (SNPs) as a nucleotide change in a sequence with a frequency >0.01 in a population while SNVs are rare variants if reported in a database with a frequency <0.01 in a population. We currently use point mutation (mutation) in adRP for a reported disease-causing genetic variant (SNV or Indel). Novel variants that resulted in a non-synonymous coding, a frameshift coding or a non-frameshift coding change on transcripts and variants located within splice sites were considered in order to identify novel candidate genes.
To screen for the mutations c.809T>G (p.L270R) in IMPDH1 and c.-14delC in PDE6G in RP patients and controls, we designed specific oligonucleotide primers and fluorescence resonance energy transfer (FRET) probe pairs targeting the variants c.809T>G in IMPDH1 and c.-14delC in PDE6G. Primers and probes (Table S2) were synthesized by TIB MOLBIOL (Berlin, Germany). The probes are separated by a single nucleotide which allows a strong FRET signal to occur. The donor probe was labeled with fluorescein at its 3′ end, the acceptor probe was labeled with LightCycler Red 640 (LC Red 640) at its 5′ end. Real-time PCR amplification was performed using the LightCycler 480 system (Roche, Barcelona, Spain). Data were acquired and analyzed with the Melting Curve Genotyping software.
Sample preparation and library construction
Genomic DNA from each of five index adRP patients was used in the NimbleGen Sequence Capture method. DNA was processed on a custom-tailored 385K array by NimbleGene containing the probes to capture the coding and flanking sequences of 448 genes previously selected as candidates for adRP. We obtained an average yield of 15 µg of captured DNA per sample. The average capture enrichment for QC loci calculated using the NimbleGen internal controls run ranged from 229- to 1053-fold. These target-enriched samples were used for construction of the sequencing library (Table S3).
Sequencing data and sequence variants
The percentage of mapped filtered samples should be >50%. The specificity of the DNA capture was measured by the percentage of ‘on target’ or ‘near target’ mapped sequences. Thus, between 27% and 34% of total mapped sequences were on target while the near target specificity was 43–53% (considering a 100-bp interval on both sides) (Table S4).
The sensitivity of the DNA capture was measured by the percentage of bases contained in the array that were covered in the sequencing. The average capture percentage in all samples was 98–99% and 93–95% assuming 1× and 20× coverage, respectively (Fig. S1).
The sequence variants detected were associated with their position in the gene transcript, annotated, and classified according to the nomenclature of the Ensembl database. Variants were classified into two groups: SNVs and Indels. The sequence variants (SNVs and small Indels) detected in all samples were compared with the variants reported in the database. We obtained a total of 4730 SNVs, 200 of which were unreported in the database. For Indels, we found a total of 222 with 124 unreported (Table 1).
Table 1. Sequence variants detected by NGS in samples from five adRP index cases
Figure 1 shows the workflow of the analysis and validation of the genomic variants found. We used the Ensembl v59 database to analyze the sequence variants found according to the functional consequence in the target transcript. Previously reported variants detected in coding and flanking regions were annotated (Table S6–S9). As a candidate variant causing adRP, we first selected the unreported SNVs that generate a non-synonymous change or affect a splicing site. We also analyzed Indels found in coding or splicing regions of the target genes, and annotated the novel variants found in the analysis of the five index adRP patients. A further analysis showed that 17 of these sequence variants were previously annotated in a database but with a frequency <0.01 (Table 2).
Table 2. Novel and rare variants (<0.01 frequency) detected in five adRP index cases
n/a, is not available.
Considered ‘Damaging’ if ≤0.05 and ‘Tolerant’ if >0.05.
Genes previously associated with recessive retinal dystrophies.
Uncharacterized protein C2orf71
EH domain-binding protein 1
Interphotoreceptor matrix proteoglycan 2
G protein-coupled receptor kinase 7
IMP (inosine 5′-monophosphate) dehydrogenase 1
Myosin III A
Spectrin, beta, non-erythrocytic 5
WD repeat and SOCS box-containing 1
Hydroxysteroid (17-beta) dehydrogenase 14
Solute carrier family 1 (glutamate transporter), member 7
Uncharacterized protein C16orf92
Uncharacterized protein C2orf71
Myosin VIIA and Rab interacting protein
Protein phosphatase, EF-hand calcium binding domain 2
Protein phosphatase, EF-hand calcium binding domain 2
Uncharacterized protein C2orf71
FERM and PDZ domain containing 1
TEA domain family member 4
Neural retina leucine zipper
Solute carrier family 1 (glutamate transporter), member 7
To validate the 55 novel sequence variants found, we carried out direct Sanger sequencing of the variants. Six (11%) of these sequence variants were not confirmed by Sanger sequencing and were considered to be false positives. Moreover, 22 sequence variants that had been characterized in a previous survey by Sanger sequencing in some adRP genes of the index patient samples were all detected by NGS. These data point out the high specificity and sensitivity of our approach.
We also analyzed the sequence variants found in each sample that were previously annotated in the Ensembl database. Sequence variants were filtered and those causing a non-synonymous, splicing, or premature stop change for SNVs, and frame or in-frame change for Indels were selected. We annotated from these variants those reported in the database with a frequency of 0.01 or lower (rare variants), but none of them proved to be disease causing according to family segregation. Other sequence variants, reported in the database with a frequency >0.01, proved to be SNPs unrelated to adRP.
Segregation of genetic variants
The 49 confirmed variants were evaluated for their pathogenic potential by three prediction tools (Table 2). Some of these variants were predicted to cause adRP. We examined the co-segregation of these 49 variants in probands' families. A variant was considered as not directly causing adRP if it was absent in an RP patient in the family. Using this criterion, novel mutations in the NRL, IMPDH1 and PDE6G genes were each found in a family (Figs 2, 3 and 4). The other variants did not segregate with RP in the families (not shown). The p.A76V mutation in NRL was detected in the index case of family 65 (Fig. 2a). This variant, although not annotated in the database, had been previously reported as a mutation of uncertain cause of RP . However, in family 65, one RP patient (IV-2, Fig. 2a) does not carry the p.A76V mutation while unaffected members of the family proved to be carriers of the mutation. Consequently, p.A76V substitution is unlikely to be causing adRP in this family.
Family 645 was included in this study because an adRP mutation survey in a Spanish population detected the novel variant p.M96T in NRL in RP patients of this family. However, segregation of this variant showed two carriers of the mutation (Fig. 2b) who so far remain asymptomatic , suggesting an incomplete penetrance for an NRL mutation. Thus, we searched for an alternative disease-causing mutation in this family. None of the detected sequence variants (Table 2) was potentially adRP causing in this family.
Family 93 had already been screened for mutations by DGGE (including IMPDH1) in a previous adRP survey without detecting any disease-causing mutation. However, in this study we detected a novel variant, p.L270R in IMPDH1, that was confirmed by Sanger sequencing (Fig. 3b). The variant is carried by all eight available patients and absent in the unaffected members analyzed (Fig. 3a). This p.L270R mutation in IMPDH1 was not detected in 100 adRP index patients or in 150 controls screened by real-time PCR using FRET probes (Fig. 3c).
Segregation of the novel sequence and rare variants detected in family 95 showed that only the c.-14delC variant in the PDE6G gene is carried by all the patients of the family, including an asymptomatic obligate carrier and two asymptomatic members (Fig. 4a). This nucleotide deletion is located at position −14 of the PDE6G gene, in a promoter region conserved in primates. We used real-time PCR with a pair of FRET probes to screen for this variant in 150 controls detecting, 15.2% heterozygous and 1.6% homozygous individuals in our population (Fig. 4c). To date, only one disease-causing mutation in PDE6G has been reported in an arRP family . This mutation c.187+1G>T in the conserved intron 3 donor splicing site in homozygous carriers was reported as causing RP in a large family in which the heterozygous carriers are unaffected.
Family 83 showed a complex trait with consanguinity in one branch of the family. Interestingly, in the index patient we detected a homozygous novel variant p.S1225_E1226insS in the C2orf71 gene. Variants in this gene have been associated with arRP [47, 48]. However, segregation of the p.S1225_E1226insS variant in the family detected unaffected members who were homozygous and heterozygous carriers (Fig. 5).
Although we limited our analysis to 448 candidate genes, a considerable number (4730) of genetic variants were identified in the samples from the five patients analyzed. After filtering our data we still found 55 novel sequence variants, 6 of which proved to be false positives after Sanger sequencing. Thus, in the five samples assayed an average of 10 novel sequence variants in RP candidate genes was obtained. Prediction of a pathogenic effect of these variants performed with PolyPhen or sift algorithms showed several potentially disease-causing variants per sample. These predictions are based on the putative loss of protein function. However, in a dominant disease adRP a pathogenic mechanism of gain of function or dominant-negative effect may occur rather than a haploinsufficient mechanism caused by a loss of function. Consequently, prediction of a pathogenic effect of genetic variants by PolyPhen or sift is limited in adRP. Moreover, segregation of variants in the families excluded most of these novel candidate variants as a cause of adRP. This clearly limits the use of this approach to large families with several RP-affected members and excludes an effective analysis for adRP in isolated cases.
In our analysis, we detected three variants that could potentially be disease causing. In family 93, the novel p.L270R mutation in the IMPDH1 gene co-segregated with the disease and it is likely to be the cause of RP in this family. The mutation p.M96T in the NRL gene in family 645 was detected in our study. We had already reported it and, as mentioned, this NRL mutation could cause adRP with incomplete penetrance in this family . Although none of the other sequence variants that were found segregated with RP in this family, the possibility cannot be ruled out that another variant outside the candidate genes analyzed could be the cause of RP in this family.
In the index case of family 65 we found the mutation p.A76V in NRL, which was previously considered to be possibly RP disease causing . This mutation was not present in one RP affected member and present in an unaffected member of this family. Consequently, the variant p.A76V seems not to be causing adRP in this family.
In families 95 and 83, we detected in PDE6G and C2orf71, respectively, a variant that is present in all the affected members of the family. Mutations in those genes had been previously associated with arRP [46-49]. However, the variants detected in PDE6G and C2orf171 were found in homozygous unaffected carries or in controls, suggesting that neither of these variants is disease causing.
NGS analysis confirmed the large number of genetic variants in an individual. Our results show the large number of polymorphisms and rare variants in candidate genes that may be involved in retinal function. Although only two disease-causing genetic variants of adRP were found in our samples, our analysis revealed multiple variants that are predicted as potentially pathogenic. The effect of these genetic variants on the structure and function of the retina is largely unknown. This existing variability in retinal function genes could explain the high heterogeneity in the clinical expression of RP, even in patients from the same family. Data obtained with NGS analysis could be a potential source for phenotype/genotype correlations. Our results revealed the possible disease-causing mutation in two adRP families despite the high number of retinal candidate genes analyzed, showing the genetic complexity of adRP. Thus, exome sequencing in adRP samples should be the method of choice to increase the chance of detecting a disease-causing genetic variant.
Our results suggest that, although NGS may reveal large numbers (potentially all) of existing genetic variants in an individual, characterization of novel disease-causing variants needs to be confirmed, e.g. by family segregation. Thus, the large number of variants predicted in a complete NGS exome analysis causes the real bottleneck for this approach in the task of segregation of the putative hundreds of variants. On the other hand, the presumed decreasing costs of sequence capturing and NGS technologies makes feasible the future sequencing (NGS) of all affected and unaffected members of a family rather than just the index case. Comparison of the data obtained should facilitate the characterization of novel disease-causing variants in large families.
We thank Tim Corson for the assistance and advices in the edition of this manuscript and to Ian Johnstone for the manuscript revision. This work was partially supported by grants from the Fondo de Investigaciones Sanitarias (FIS PS09/01271, PI09/90754) and the RIRAAF cooperative research network (RETICS) RD07/0064/2005.