A specific diagnosis can markedly improve the psychological and economic well-being of patients and families [Graungaard and Skov, 2007]. In addition, therapeutic interventions, including reproductive counseling, are increasingly dependent on an accurate molecular diagnosis. The specific molecular basis of disease is often difficult to determine for disorders with extensive genetic heterogeneity. Therefore, improvements in the cost, speed, and effectiveness of molecular-diagnostic techniques provide important clinical benefits and are especially needed in this setting.
Two examples of genetically heterogeneous disorders include the hereditary spastic paraplegias (SPGs) [Stevanin et al., 2008] and the limb girdle muscular dystrophies (LGMD) [Guglieri et al., 2008]. Current techniques for nonmolecular diagnosis rely on clinical examination, electromyography, nerve conduction studies, MRI imaging, and biopsy [Mercuri, 2010; Norwood et al., 2007], that is, studies that are either invasive or limited in sensitivity and specificity [Charlton et al., 2009]. Obtaining precise molecular diagnoses in the SPGs and muscular dystrophies is known to be problematic [England et al., 2009; Smith et al., 1993]. In particular, the cost of using Sanger sequencing to screen individual exons for mutations is cost prohibitive.
In an effort to reduce diagnostic costs, a variety of alternative approaches to mutation screening have been explored, including homozygosity mapping [Papic et al., 2011], protein arrays [Escher et al., 2010], and gene chips (http://www.nmd-chip.eu/). Exome sequencing (ES) is efficient and economical compared with the other listed methods. It surveys a broad list of potential regions (nearly all exons) and is increasingly inexpensive. Furthermore, it is not limited to a small set of targeted mutations (e.g., gene chips) or a specific population history (e.g., homozygosity mapping techniques) [Ng et al., 2009]. Unlike targeted capture [Brownstein et al., 2011; Shen et al., 2011] or polymerase chain reaction of subsets of genes [Shearer et al., 2010], it is not limited to known disease-related genes. Variation detection using next generation sequencing offers high sensitivity [Harismendy et al., 2009]. While different next generation sequencing methods are available, the various methods of exome capture have shown similar accuracy of genotype calling [Asan et al., 2011].
Based upon these factors as well as economic considerations [Carlson, 2010], we examined whether ES might be appropriate for diagnosis in the setting of clinically suspected SPG, LGMD, or other muscular dystrophies and myopathies. Second, we considered how to construct a clinical report that would accurately reflect the information yielded by the ES assay.
We applied ES sequencing to diagnosis in two projects. The first was a single family for whom LGMD had been diagnosed clinically, but no causative mutations had been found by previous testing. The second was a cohort of 125 individuals (from 34 families) with a variety of unknown diagnoses. The cohort was used to assess sequencing coverage for two sets of genes: one associated with SPG and one associated with a selection of other muscular diseases (MDs). We present detailed analysis of our findings and demonstrate that ES has the potential to address the requisites of clinicians and diagnostic laboratories as a first-tier molecular test for genetically heterogeneous neuromuscular disorders.
Materials and Methods
Clinical Presentation of the LGMD-Diagnosed Family
The LGMD proposita (II-1) is a 19-year-old woman of healthy unrelated Croatian parents who came from a small town and had no prior family history of MD. The proposita and her sister developed a progressive limb-girdle muscular dystrophy with onset at 13 and 11 years, respectively. A biopsy of II-1's deltoid muscle at age 17 years showed a dystrophic process without an inflammatory component. Specific features included multiple internalized nuclei, endomysial fibrosis, occasional regenerating muscle fibers, fiber atrophy, and grouping of type I and type II myofibers. Immunohistochemistry and immunofluorescence did not detect abnormalities of dystroglycans (α and β), sarcoglycans (α, β, δ, and γ), dystrophin (fractions 1, 2, and 3), merosin, or calpain 3.
Patients enrolled in the study gave informed consent for protocol H09-01228 (Vancouver, BC, Canada), approved by the University of British Columbia Research Ethics Board, or for protocol 76-HG-0238 (National Institutes of Health, Bethesda, MD), approved by the Institutional Review Board of the National Human Research Institute. ES data used for the coverage-assessment analyses were obtained from 125 individuals enrolled in the NIH Undiagnosed Diseases Program (UDP). The 125 individuals, from 34 unrelated families, included 54 founders. Intrafamilial and interfamilial correlation (Pearson correlation coefficient) of sequencing coverage was not significantly different, and subsequent coverage analyses treated the samples as independent.
Exome Capture, Massively Parallel Sequencing, and Alignment
ES was performed using genomic DNA extracted from peripheral blood. In-solution exome capture was performed according to the manufacturer's protocol using SureSelect Human All Exon Kits (Agilent Technologies, Santa Clara, CA) or the TruSeq Exome Enrichment Kit (Illumina, San Diego, CA). The Agilent SureSelect Human All Exon Kit (38Mb) kit was used for the proposita and 78 UDP samples, the Agilent SureSelect Human All Exon 50Mb Kit was used for 36 UDP samples, and the Illumina TruSeq kit was used for 11 UDP samples. Data for the 125 exome cohort was grouped according to the exome capture kit used. The set of 78 exomes captured with the Agilent SureSelect Human All Exon Kit (38Mb) will be referred to as E-38; the 36 exomes captured with the Agilent SureSelect Human All Exon 50Mb Kit will be referred to as E-50; the 11 exomes captured with the Illumina TruSeq Exome Enrichment Kit will be referred to as E-62. Flow-cell preparation and 76 to 100-bp (base pair) paired-end (PE) read sequencing were performed per the protocol for the Illumina Genome AnalyzerIIx (Illumina, San Diego, CA).
For the LGMD proposita, the short-read PE sequences were aligned to the human genome reference sequence (NCBI build 36; hg18) and the alignments were represented in the BAM file format. Sequence variants were called using the NextGENe software (Softgenetics, Pennsylvania). NCBI dbSNP data (dbSNP128) was used within the NextGENe software for polymorphic variant analysis.
For the 125 ES data, alignment to the human genome reference sequence (UCSC assembly hg18, NCBI build 36) was carried out using the Efficient Local Alignment of Nucleotide Data algorithm (Eland, Illumina, Inc., San Diego, CA). Eland was used in such a way that PE reads were aligned independently, and those that aligned uniquely were grouped into genomic sequence intervals of about 100 kb. Reads that failed to align were binned with their PE mates without Eland using the PE information. Reads that mapped equally well in more than one location were discarded. Crossmatch, a Smith–Waterman-based local alignment algorithm, was used to align binned reads to their respective 100 kb genomic sequence, using the parameters −minscore 21 and −masklevel 0 (http://www.phrap.org) [Wei et al., 2011]. Genotypes were called using a Bayesian genotype caller, most probable genotype (MPG) [Teer et al., 2010].
Sanger Capillary Sequencing
Sanger sequencing was performed according to standard protocols (see Supp. Methods) on exon 4 of the CAPN3 gene in the proband, and on the single coding exon of the FKRP gene and the last coding exon of the TPM3 gene on 12 randomly selected control DNA samples.
Exome Sequencing Coverage Analysis
Genes of interest causing MD and SPG (Supp. Tables S1 and S2) were selected from published articles (PubMed search queries: SPG; muscular dystrophy; myopathy), the Washington University Neuromuscular Disease Center (http://neuromuscular.wustl.edu) and the NCBI Online Mendelian Inheritance in Man (OMIM, www.ncbi.nlm.nih.gov/omim). Supp. Tables S1 and S2 list the selected genes and neuromuscular phenotype, as well as allelic phenotypes. Genes causing isolated cardiomyopathy without reported skeletal muscle involvement or metabolic disease associated with myopathy were not included. These criteria identified 64 genes for muscular disease (MD dataset) and 24 genes for spastic paraplegia (SPG dataset), 88 genes in all. Each of the 88 genes was annotated within the University of California, Santa Cruz (UCSC) Known Genes database. Not every known disease-causing gene is annotated in the Consensus Coding Sequence (CCDS) database [Pruitt et al., 2009]. Of the 64 MD and 24 SPG genes, 59 and 23, respectively, are annotated in the CCDS.
All exon coverage analysis
For overall and comprehensive gene coverage analysis, UCSC-annotated genes (NCBI build 36; hg18) were downloaded from the UCSC table browser. Corresponding exons, for all transcripts of the 88 genes, were extracted using the UCSC genome browser in BED file format and a collapsed, unique, merged-exon BED file was generated to account for all exons in a given gene locus. These unique-exon BED files were used to query the ES data for coverage at 496,922 (47–91% targeted) and 102,707 (53–92% targeted) nucleotide bases of MD and SPG genes, respectively (Table 1).
Table 1. Analysis of 125 Exomes for Read Depth within the UCSC-Annotated Exons of MD and Hereditary SPG Genes
a) The exons of the 64 UCSC muscle disease genes studied span 496,922 nucleotide bases. b) The exons of the 24 UCSC SPG genes studied span 102,707 bases. c) 232,954 (46.88%) MD nucleotides are included in the E-38 kit, d) 361,419 (72.73%) in the E-50 kit, and (e) 449,870 (90.53%) in the E-60 kit. f) 54,819 (53.37%) SPG nucleotides are included in the E-38 kit, g) 58,198 (56.66%) in the E-50 kit, and (h) 94,384 (91.90%) in the E-62 kit. Subsequent coverage uses the number of nucleotides in the respective exon enrichment kit as denominator. i) Range (min. to max.) of the median sequence coverage per exome within each dataset. j) Range of the mean (average) sequence coverage per exome within each dataset. Range of the percent of bases that had coverage k) ≥20X in each exome; average in parentheses. Range of the proportion, across all exomes, of bases with l) 0X, m) <20X, and n) ≥20X; averages in parentheses. o) Analysis per base position refers to the analysis of each individual MD or SPG position across every exome; that is, for the E-38 muscle disease gene dataset, 3% of bases were identified as consistently having <20X in all 78 samples (p), 61% consistently had ≥20X in all 78 samples (q), and 36% of bases had variable coverage (<20X in some exomes and ≥20X in others). For the MD and SPG nucleotides targeted by the E-38 capture, none had null coverage in every exome. For the E-50 kit, only 2 of the 361,419 targeted MD nucleotides (r) and 2 of the 102,707 targeted SPG nucleotides (s) consistently had null coverage in every exome tested. For the E-62 kit, 1097 MD bases had null coverage across all exomes (t); 333 SPG bases had null coverage across all exomes (u).
Exon enrichment kit (# samples)
Exonic bases targeted per platform
Analysis per exome
Median coverage per exomei
Average coverage per exomej
% of targeted bases with ≥ 20X per exomek
Analysis across samples
Range % bases w/o sequencel (average)
Range % bases with <20Xm (average)
Range % bases with ≥ 20Xn (average)
Analysis per base positiono
Bases with < 20X in all exomesp
Bases with ≥20X in all exomesq
Bases with ≥20X in at least 1 exome
Bases w/o sequence in all exomes
Bases with ≥1X in all exomes
Analysis was restricted to exons as most reported disease-causing mutations reside in these regions [Cooper et al., 2011]. Because not all the bases in the exons (e.g., untranslated regions (UTRs)) were targeted for capture by the commercial kits, subsequent analyses focused on the intersection of regions targeted by each in-solution capture kit and the unique-exon BED files (Agilent Technologies, Santa Clara, CA; Illumina, San Diego, CA). The intersections were extracted from the unique-exon BED files using the online tool Galaxy [Blankenberg et al., 2001; Goecks et al., 2010]. We refer to the intersection of unique-exon BED file and targeted regions as MD-UE and SPG-UE for the MD and SPG genes, respectively.
CCDS coding bases coverage analysis
To analyze the ES data in the context of well-curated protein-coding regions (cds), the NCBI Consensus CDS (CCDS) (version NCBI build 36; hg18) [Pruitt et al., 2009] for 82 genes (59 MD genes; 23 SPG genes) were downloaded from the UCSC table browser, and coding bases from these genes were extracted into BED files. Six genes from the original gene list were not annotated by CCDS (Supp. Tables S1 and S2) and, therefore, these genes were not considered in our subsequent analyses of well-curated coding bases. Multiple overlapping coding regions of multiple transcripts for a given gene were collapsed into a single unique BED file. We refer to these unique coding regions of the CCDS as MD-CDS and SPG-CDS for MD and SPG genes, respectively. In total, 214,384 (97–98% targeted) and 53,957 (98–99% targeted) coding bases of MD and SPG genes were interrogated to evaluate the performance of ES sequencing on the high-quality protein-coding fraction of our gene panel (Table 2).
Table 2. Percentage of Potein-Coding Bases (CCDS) of 82 MD and SPG Genes Targeted by Three Different Exome-Capture Kits (E-38, E-50, and E-62)
aRange of the percentage of coding bases that were sequenced with a coverage of 1X per exome of each dataset.
bRange of the percentage of coding bases that were sequenced with a coverage of 1X per exome of each dataset.
Exon enrichment kit (# samples)
Coding bases targeted per platform (%)
Analysis per exome
% of coding bases with ≥ 1X/exomea
% of coding bases with ≥ 20X/exomeb
Exome Sequence Coverage of Known Variant Sites
Genotyping accuracy of next generation sequencing has been assessed by concordance with SNP arrays for publicly available SNP positions [Craig et al., 2008; Harismendy et al., 2009]; therefore, we also assessed ES coverage at captured nucleotides corresponding to the sequence variants deposited in NCBI dbSNP130 (ftp://ftp.ncbi.nih.gov/snp/). We constructed two BED files containing all SNPs within the exons of the 64 MD genes and 24 SPG genes, respectively. For assessment of coverage at pathogenic mutations deposited in the Human Gene Mutation Database (HGMD Professional release 2010.2, Biobase, Wolfenbüttel, Germany) [Stenson et al., 2009], we constructed two BED files containing positions for all exonic mutations of the 64 MD and 24 SPG genes [Cooper et al., 2011]. The respective files were intersected with the targeted regions (as reported by the capture kit manufacturer) to obtain all exonic SNP or mutation positions targeted by each exon enrichment kit. Subsequent calculations were restricted to these targeted exonic mutation positions for each kit. For SNPs and mutations spanning greater than one nucleotide base, and/or for overlapping SNPs and mutations, each single nucleotide base was queried for coverage independently and only once to avoid redundancy.
For Illumina GAIIx 76–100-bp PE read sequencing, bases with at least 1X (1 fold) coverage are considered amenable to being sequenced, and those with a single base coverage of ≥20X (20 fold) are amenable to being genotyped [Bentley et al., 2008; Craig et al., 2008; Harismendy et al., 2009]. We, therefore, established two single base coverage thresholds for our analyses: 1X as a proxy for ability to sequence, and 20X as a minimum for confident genotype calling.
Nucleotide coverage queries and read depth analyses were performed using SAMtools [Li et al., 2009]. The depth of coverage was queried from BAM files with SAMtools and subsequent data manipulation was performed with custom scripting using the Python (v2.7.1) and Perl (v5.12.1) programming languages. Descriptive statistics for sequencing data analysis were obtained using SPSS software (version 19.0; SPSS Inc., Chicago, IL) and R statistical computing software (version 2.11.1) (Supp. Tables S3 and S4).
Exome Sequencing Detects a CAPN3 Mutation in Family 1
Comparison of the LGMD proposita's exome sequence to the human reference sequence (NCBI Build 36; hg18) and to the polymorphism database dbSNP128 identified 11,369 sequence variants unreported as polymorphisms (SNPs). Given that the parents came from the same small town, we looked for nonsynonymous homozygous mutations potentially inherited from a common ancestor. Among the 302 putative homozygous variants that encode an amino acid alteration was NG_008660.1:c.550delA, a single nucleotide deletion encoding the frameshift mutation p.Thr184ArgfsX36 in CAPN3 (NP_000061.1) (Fig. 1A). This mutation, which was validated by Sanger dideoxy nucleotide chain-termination sequencing and segregated with disease in the family, is a previously described cause of LGMD2A in the Croatian population (Fig. 1B) [Canki-Klain et al., 2004].
Exome Sequencing Profiles Most MD and SPG-Associated Nucleotides
Given the speed and ease with which we identified the disease-causing mutation in this family, where conventional immunohistological diagnostic techniques were falsely negative for a calpain-3 defect, we hypothesized that ES approach could be generally useful in the clinical diagnosis of genetically heterogeneous neuromuscular disorders such as hereditary MDs or SPGs (Supp. Tables S1 and S2). To test this hypothesis, we analyzed ES results from 34 unrelated families that included 125 individuals and defined the coverage for each nucleotide within the 2,532 exonic intervals of the 64 MD and 24 SPG genes. We used coverage as a proxy for the likelihood of detecting a coding mutation in a known disease gene. Based on the premises discussed in the methods, we assessed each nucleotide at two depths of sequencing coverage (≥1X and ≥20X).
The average overall coverage per exome was 94X in the 78 exomes obtained with the 38 Mb exon enrichment kit (E-38), 111X in the 36 exomes obtained using the 50 Mb exon enrichment kit (E-50) and 108X in the 11 exomes obtained using the TruSeq Exome Enrichment kit (E-62).
Focusing on the exonic bases of the selected 88 MD and SPG genes, qualitative inspection of the UCSC custom tracks (Supp. BedGraph Files S1, S2, and S3) depicting median coverage showed poor uniformity of coverage among samples and across exons. This was quantitatively summarized for each sample using descriptive statistical measures (Supp. Tables S3 and S4) and is readily observed in the box and whisker plots of each exome (Fig. 2).
For the exonic bases targeted by the E-38, E-50, and E-62 kits and within the 88 MD and SPG genes, up to 20% of positions were not consistently sequenced to the 20X level in every exome (Supp. Tables S5–S10). This variability arose from (1) low-coverage at exon–intron boundaries that was consistent across the respective exomes in the E-38, E-50, and E-62 sets, and (2) low-coverage within the exons. The latter could be separated into two groups: (1) regions of low-coverage that were consistent across all samples, and (2) poorly covered single nucleotides or small stretches of nucleotides in several, but not all, exomes. Nucleotides flanking the poorly covered single nucleotides or small stretches of nucleotides had high coverage. Each source of variability was equally represented in the 54 independent founder exomes and in the additional 71 exomes.
To determine whether these areas of low coverage affected the utility of ES as a sole diagnostic test, we analyzed the coverage of each nucleotide in the selected regions of the 88 genes associated with MD and SPG. The distribution of coverage did not differ significantly between the exome sequences derived from the 54 founders and the exome sequences derived from all 125 exomes. Therefore, we present the data referring to the 125 exomes.
Exome Sequencing Provides ≥20X Coverage per Exome for 80–94% of Targeted Muscle Disease UCSC-Annotated Exonic Bases in 125 Exomes
The UCSC-annotated exonic sequences of the selected MD genes span 496.9kb. The E-38, E-50, and E-62 kits targeted 47, 73 and 91%, respectively, of these nucleotides (Table 1). Three MD genes (NEB, SGCD, and MYBPC3) were not targeted by enrichment kit E-38. For targeted nucleotides (MD-UE), the median coverage ranged from 45 to 164X for the 78 E-38 exomes, from 95 to 160X for the 36 E-50 exomes, and from 63 to 153X for the 11 E-62 exomes (Table 1; Fig. 2). For the MD-UE, the targeted regions with ≥20X coverage varied between an overall minimum of 80% and a maximum of 94%. As summarized in Table 1 and Figure 3, several MD-UE nucleotides had no or insufficient coverage in each exome for each of the three exon enrichment datasets. Also, the range of coverage for any given nucleotide varied significantly (Supp. Tables S5, S6, and S7). Nucleotides with null coverage tended to cluster within narrow regions that had poor coverage in multiple exomes, but no nucleotide had null coverage consistently in every E-38 exome and only 2 nucleotides had null coverage in every E-50 exome. However, 1079 nucleotides had no coverage in every E-62 exome; of these 1079 nucleotide positions, 98.2% mapped to exons or exon/intron boundaries of the TPM3, KIAA0613, FKRP, BIN1, PABPN1, and RYR1 genes.
Exome Sequencing Provides ≥20X Coverage per Exome for 77–95% of Targeted SPG UCSC-Annotated Exonic Bases in 125 Exomes
The UCSC-annotated exonic sequences of the hereditary SPG genes span 102.7kb. The E-38, E-50 and E-62 kits targeted 53, 57 and 92%, respectively, of these nucleotides (Table 1). For targeted nucleotides (SPG-UE), the median coverage ranged from 41 to 146X for the E-38 exomes, from 60 to 104X for the E-50 exomes and from 58 to 142X for the E-62 exomes (Table 1; Fig. 2). For the SPG-UE, the targeted regions with ≥ 20X coverage varied between an overall minimum of 77% and a maximum of 95%. Again, the range of coverage varied significantly (Supp. Tables S8, S9, and S10), and nucleotides with null coverage generally clustered within narrow regions that had poor coverage in several exomes. Also, similar to the MD-UE data, no nucleotide had null coverage in every E-38 exome and only 2 nucleotides had null coverage in every E-50 exome. In every E-62 exome, however, 333 nucleotides had no coverage (Table 1, Fig. 3), and 93.7% of these positions mapped to the start of the first exon of the GJC2, NIPA1, and SPG21 genes.
Exome Sequencing Provides ≥20X Coverage per Exome for 77–96% of All Nucleotides Within the Conserved Coding Bases (CCDS)
Given the low targeting (47–92%) of the UCSC-annotated exonic bases of the 88 MD and SPG genes, we narrowed our interrogation to the well-curated bases defined by the CCDS annotation (detailed coverage data per nucleotide provided in Supp. Tables S11 and S12). Of the 64 MD genes and 24 SPG genes, 59 and 23, respectively, are annotated in the CCDS database. The E-38, E-50 and E-62 kits targeted 97.6, 97.7, and 97.1% of MD and 98.3, 98.3, and 98.7% of SPG nucleotides, respectively (Table 2). For the MD and SPG genes for which all CCDS nucleotides were targeted, an average of 96.7 (207,859 bases) and 98.1% (52,928 bases), respectively, had ≥1X coverage and 85 (182,232 bases) and 86.8% (46,826 bases), respectively, had ≥20X coverage in all 125 exomes.
All three exome capture kits (E38, E50, and E62) fully targeted the CCDS-annotated exons of 35 MD and 14 SPG genes (Supp. Fig. S1). These genes had 104X mean coverage across the 125 exomes. Across the 125 exomes, an average of 88.5–100% (median of averages: 99.8%) of nucleotides had ≥1X coverage for any given MD gene. Similarly, an average of 93.8–100% (median of averages: 99.9%) of nucleotides had ≥1X coverage for any given SPG gene. In the same dataset, the mean percentage of nucleotides with ≥20X coverage ranged from 19.1 to 99.5% (median of averages: 92.2%) for any given MD gene and from 61.2 to 99.5% (median: 94.6%) for any given SPG gene. Of note, the CCDS-annotated exons of four MD and two SPG genes had poor coverage in all 125 exomes; specifically, only 19, 57, 60 61, 62, and 66% of the targeted nucleotides in FKRP, TCAP, LMNA, FA2H, DES, and PNPLA6 had ≥ 20X coverage, respectively (Fig. 4A and B) (detailed percentage coverage data per gene for ≥1, ≥10, and ≥20X coverage is provided in Supp. Tables S13–S18). To assess whether these results were improved by the latest ES technology (TruSeq capture and latest available next generation sequencing chemistry), we limited the analysis to the 11 samples tested with this ES kit and demonstrate improved coverage in comparison to the data from all 125 samples (Fig. 4C and D) for the percentage of nucleotides per gene with ≥20X coverage. We also assessed the same measure per gene for ≥1X coverage and ≥10X coverage (Supp. Fig. S2) to demonstrate the sequencability of the samples targeted with the latest capture kit.
Exome Sequencing Provides ≥20X Coverage in Every Sample for 54–89% of Nucleotides with Reported Single Nucleotide Polymorphisms
To assess the implications of the variable coverage for detection of known sequence variants within the targeted exons of the MD and SPG genes, we analyzed sequence coverage for 3,822 single nucleotide polymorphism (SNP) positions deposited in dbSNP130 and mapping to UCSC-annotated exons (Table 3). Of the 3,282 SNP positions within the MD genes, 49, 72, and 93% were targeted by kits E-38 E-50, and E-62, respectively. Analysis of targeted SNP positions within the three datasets showed that 54 (E-38), 78 (E-50), and 87% (E-62) had ≥20X coverage in every exome, that 97, 93, and 97% had ≥20X coverage in at least one exome, that 94, 96, and 97% had ≥1X coverage in every sample of each dataset, and that 100, 100, and 99.9% had ≥1X coverage in at least one exome (Table 3, Supp. Fig. S3A). Similar results were observed when the analysis was limited to targeted bases within the CCDS-annotated exons.
Table 3. Analysis of 125 Exomes for Read Depth at Positions of SNPs Reported in dbSNP within the UCSC-Annotated Exons of MD and Hereditary SPG Genes
MD gene dataset
SPG gene dataset
Total exonic SNP positions in dbSNP 130
Abbreviations:%, percentage; MD, muscle disease; No., number; SPG, spastic paraplegia; w/o, without; X, fold coverage. a) Within the coding sequence of the 64 MD genes, 3,282 bases are referenced in dbSNP130. b) Within the coding sequence of the 24 spastic paraplegia genes, 539 bases are referenced in dbSNP130 as SNPs. c) 1600 (48.75%) MD SNP positions are targeted by the E-38 exon enrichment kit, d) 2,376 (74.86%) by the E-50 kit, and e) 3067 (93.45%) by the E-62 kit. f) 231 (42.85%) SPG SNP positions are targeted by the E-38 kit, g) 249 (46.20%) by the E-50 kit, and h) 493 (92.32%) by the E-62 kit. i) Range (minimum to maximum) of the median coverage per exome in each dataset for the targeted SNPs. Range (minimum to maximum) of SNP positions in each exome with coverage of j) 0X and k) <20X; average percentage in parentheses. l) Analysis per SNP position refers to the analysis of each individual SNP position across every exome; that is, for the E-38 MD gene dataset, 3% of SNP positions were identified as consistently having <20X in all 78 samples (m), 54% consistently had ≥20X in all 78 samples (n), and 43% of bases had variable coverage (<20X in some exomes and ≥20X in others. o) None of the SNP positions had null coverage in every E-38 or E-50 exome for both MD and SPG datasets; only 2 SNPs within the MD dataset had null coverage in every E-62 exome, and 4 SNPs within the SPG dataset.
Exon enrichment kit (# of samples)
Number of SNP positions targeted per kit
Median coverage per exomei
SNP position analysis across samples
Range SNP positions w/o sequencej (average%)
Range SNP positions with <20Xk (average%)
Average% SNP positions with ≥20X
Analysis per SNP positionl
SNP positions with < 20X in all exomesm
SNP positions with ≥20X in all exomesn
SNP positions with ≥20X in at least 1 exome
SNP positions w/o sequence in all exomeso
SNP positions with ≥1X in all exomes
Of the 539 SNP positions reported within SPG gene coding sequences, 43, 46, and 92% were targeted by kits E-38, E-50, and E-62, respectively (Table 3). Similar to observations for the MD genes, analysis of targeted SPG SNP positions within the E-38, E-50 and E-62 datasets, respectively, showed that 55, 74, and 89% had ≥20X coverage in every exome, that 96, 91, and 96% had ≥20X coverage in at least one exome, that 94, 95, and 96% had ≥1X in every exome, and that 100, 100, and 99.8% had ≥1X coverage in at least one exome (Supp. Fig. S3B). Again, similar results were observed when the analysis was limited to targeted bases within the CCDS-annotated exons.
Exome Sequencing Provides ≥20X Coverage in All Exomes for 86–96% of Nucleotides Having a Reported Disease-Associated Mutation
To assess the implications of the variable coverage for detection of known pathogenic mutations within the targeted coding sequence of the MD and SPG genes, we also analyzed coverage at bases corresponding to mutations reported in the Human Gene Mutation Database (HGMD). To avoid overrepresentation bias, each position was considered only once, regardless of whether it was involved in more than one reported mutation.
Of the 3,570 mutation positions reported within the UCSC-annotated exons of the MD genes, the E-38, E-50, and E-62 kits, respectively, targeted 3,268 (92%), 3,542 (99%), and 3,543 (99%) (Table 4). Analysis of targeted mutation positions in the E-38, E-50, and E-62 datasets, respectively, showed that 59, 75, and 87% had ≥20X coverage in every exome, that 98, 94, and 99% had ≥20X coverage in at least one exome, that 96, 98, and 99% had ≥1X coverage in every exome, and that 100, 99.9, and 99.9% had ≥1X coverage in at least one exome (Supp. Fig. S4A). Similar results were observed when the analysis was limited to targeted bases within the CCDS-annotated exons.
Table 4. Analysis of 125 Exomes for Read Depth at Positions of Mutations within the UCSC-Annotated Exons of MD and Hereditary SPG Genes Reported in the Human Genome Mutation Database
MD gene dataset
SPG gene dataset
Total exonic positions corresponding to HGMD mutations*
*Every mutation registered in the HGMD database located within the exonic regions of the selected genes (all allelic phenotypes). Each nucleotide base position (nt) is considered a single base query, that is, for mutations spanning >1nt or for nt involved in more than one mutation, each nt is considered a single “independent mutation” to avoid redundancy. Abbreviations:%, percentage; MD, muscle disease; No., number; SPG, spastic paraplegia; w/o, without; X, fold coverage. a) Within the exons of the 64 MD genes, 3,570 bases are referenced in HGMD as a position in which a mutation has been identified. b) Within the exons of the 24 spastic paraplegia genes, 799 bases are referenced in HGMD as a position in which a mutation has been identified. c) 3,268 (91.54%%) MD mutation positions are targeted by the E-38 kit, d) 3,542 (99.22%) by the E-50 kit, and e) 3543 (99.24%) by the E-62 kit. f) 785 (98.25%) SPG mutation positions are targeted by the E-38 kit, g) 786 (98.37%) by the E-50 kit, and h) 785 (98.25%) by the E-62 kit. i) Range (minimum to maximum) of the median coverage per exome in each dataset for the targeted HGMD positions, respectively. Range (minimum to maximum) of HGMD positions in each exome with j) 0X and k) <20X coverage; average percentage in parentheses. l) Analysis per HGMD position refers to the analysis of each individual HGMD nucleotide base position across every exome; that is, for the E-38 MD gene dataset, 2% of HGMD positions were identified as consistently having <20X coverage in all 78 samples (m), 59% consistently had ≥20X coverage in all 78 samples (n), and 39% of bases had variable coverage (<20X in some exomes and ≥20X in others). o) For the E-38 exomes, none of the HGMD positions had null coverage in every exome tested; the E-50 exomes had consistently null coverage in 2 nucleotides of the MD mutation positions (p) and none in the SPG mutation positions; the E-62 exomes had q) 5 mutation bases in the MD dataset that had consistently null coverage across all 11 exomes (corresponding to 5 mutation positions within the same RYR1 exon), and r) 4 SPG mutation positions with null coverage across all 11 exomes (corresponding to 4 mutation positions within the same GJC2 exon.
Exon enrichment kit (# of samples)
No. of mutation positions targeted per kit
Range of median coverage per exomei
HGMD nucleotide position analysis across samples
Range HGMD positions w/o sequence (average%)j
Range HGMD positions with <20X (average%) k
Average% HGMD positions with ≥20X
Analysis per HGMD mutation position l
HGMD positions with < 20X in all exomes m
HGMD positions with ≥20X in all exomes n
HGMD positions with ≥20X in at least 1 exome
HGMD positions w/o sequence in all exomeso
HGMD positions with ≥1X in all exomes
Of the 799 mutation positions reported within the UCSC-annotated exons of the SPG genes, 785 (98%), 786 (98%), and 785 (98%) were targeted by the E-38, E-50, and E-62 kits, respectively (Table 4). As observed for the MD mutation positions, analysis of the targeted mutation positions in the three datasets, respectively, showed that 55 (E-38), 82, (E-50), and 86% (E-62) had ≥20X coverage all exomes, that 99, 97, and 96% had ≥20X coverage in at least one exome, that 95, 98, and 97% had ≥1X coverage in all exomes, and that 100, 100, and 99.5% had ≥1X coverage in at least one exome (Supp. Fig. S4B). Again, similar results were observed when the analysis was limited to targeted bases within the CCDS-annotated exons.
Sanger Sequencing Effectively Fills Gaps in Exome Sequence Data
Given that a small number of the nucleotides of interest in CCDS annotated regions had incomplete coverage by ES, we hypothesized that Sanger sequencing could be used as a complementary method, particularly for regions of high GC content or regions predisposed to misalignment because of paralogous sequences. To test the former, we chose the CCDS-annotated exon of FKRP, which has a GC content of 70.5%, and performed Sanger sequencing for 12 DNA samples that previously had ES. For all 12 DNA samples, Sanger sequencing provided full and unambiguous sequence of the protein coding nucleotides of FKRP. To test the later, we chose the last CCDS-annotated exon of TPM3, which has no sequence coverage in any data set and 90.4–98.8% similarity to 12 other regions in the human reference genome (Supp. Table S19). Again, Sanger sequencing for the same 12 DNA samples provided full and unambiguous sequence of the protein coding nucleotides of the last CCDS-annotated exon of TPM3.
We have shown that ES provided a rapid and minimally invasive method to diagnose LGMD2A in sisters that were not diagnosed by immunohistological analysis of a muscle biopsy or other clinical tests. In addition, we show that with the caveats discussed below, ES is a generalizable diagnostic method for genetically heterogeneous neuromuscular disorders and, by extension, other genetically heterogeneous diseases.
ES enables screening for mutations in a large number of genes concurrently, is noninvasive, and is cost effective compared to other clinical testing options. For the UCSC-annotated exons of the 88 MD and hereditary SPG genes targeted by the E-38, E-50, and E-62 kits, 77–95% of nucleotides per exome had adequate sequence coverage to identify a mutation (≥20X), and this was accomplished at a fraction of the cost of Sanger sequencing or pathological assays. Despite this advantage, however, certain caveats must be addressed as ES is implemented for clinical diagnostics. These include development of standard clinical reports defining precisely what has been tested in each individual and recommendations for follow-up using Sanger sequencing for nucleotides not tested by ES. The present analysis indicates that current ES does not test all disease genes or all exons in disease genes and that, even for those it tests, it does not reproducibly provide adequate sequence coverage for every single nucleotide. This occurs because the available ES kits do not capture all the nucleotides of clinical interest and because of limitations inherent in the processes of exome enrichment, library construction and amplification. Consequently, as previously observed with a smaller sample set [Hedges et al., 2011), ES alone does not currently provide sufficient coverage for comprehensive variant calling in a clinical diagnostic setting.
The incomplete coverage we have observed has been partially addressed by the improvements in sequencing technology during the time of this study. Expansion of the capture kits has increased the number of genes enriched for in the libraries prepared for ES; however, it has not yet fully addressed the issues of poor coverage within targeted regions. Given that most diagnostic laboratories are not equipped to optimize or improve next generation sequencing technology, there are two options for addressing these regions of poor coverage: more next generation sequencing or parallel Sanger sequencing. As suggested by the high percentage of bases with at least 1X coverage or with 20X coverage in at least one exome (Tables 1–3), some nucleotides with low coverage could be improved by increasing the overall depth of sequencing. In contrast, null coverage suggests that the locus cannot be captured or sequenced or that the sequence data is too poor to align; therefore, additional next generation sequencing using the same method is unlikely to improve the coverage. By comparison, Sanger sequencing does improve coverage for at least some regions.
Complementation of massively parallel sequencing with Sanger sequencing of predictable regions of low coverage, such as exon boundaries that have been identified a priori, can be initiated at the project setup. For exonic nucleotides that have low coverage in regions not predictable beforehand, Sanger sequencing of these regions would be performed following analysis of the next generation sequence data (Fig. 5). This combination of next generation and Sanger sequencing provides a tiered approach to molecular diagnostics. In this model, ES is a first-tier screening test that can be supplemented with second-tier Sanger sequencing, although the amount of necessary Sanger sequencing will likely rapidly decline with continued refinement of ES and genome sequencing technology. Regardless, taking this approach requires that clinical reports clearly define what ES has adequately assessed.
An alternative interim approach pursued currently by some laboratories is exon enrichment for subsets of genes. This decreases the complexity of the library and thus increases the depth of coverage for those particular genes and possibly the number of nucleotides consistently sequenced; however, the approach is limited to currently known disease-associated genes and thus requires continued generation of new capture kits as well as retesting of patients that previously tested negative. In contrast, ES and genome sequencing do not have these limitations because they are applicable to all genetically heterogeneous diseases and are inclusive of genes not yet disease associated. Therefore the generic approach of ES and genome sequencing make them much more appealing.
In summary, ES provides a noninvasive first-tier method to screen rapidly for mutations associated with genetically heterogeneous disorders. Application of ES as the sole diagnostic test is limited by failure to target some exons and by low sequence coverage of some nucleotides. Nonetheless precise reporting of inadequately tested nucleotides and supplemental Sanger sequencing provide a sensitive and cost-effective molecular diagnostic approach.
We thank Dr. Camilo Toro, Dr. Jan Friedman, and Dr. Alireza Baradaran-Heravi for critical review of the manuscript.