SEARCH

SEARCH BY CITATION

Keywords:

  • Exome sequencing;
  • inherited disease;
  • false positives;
  • next generation sequencing;
  • genomics;
  • Illumina;
  • sequencing errors;
  • alignment errors;
  • WES;
  • SureSelect Human All Exon

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Disease gene discovery has been transformed by affordable sequencing of exomes and genomes. Identification of disease-causing mutations requires sifting through a large number of sequence variants. A subset of the variants are unlikely to be good candidates for disease causation based on one or more of the following criteria: (1) being located in genomic regions known to be highly polymorphic, (2) having characteristics suggesting assembly misalignment, and/or (3) being labeled as variants based on misleading reference genome information. We analyzed exome sequence data from 118 individuals in 29 families seen in the NIH Undiagnosed Diseases Program (UDP) to create lists of variants and genes with these characteristics. Specifically, we identified several groups of genes that are candidates for provisional exclusion during exome analysis: 23,389 positions with excess heterozygosity suggestive of alignment errors and 1,009 positions in which the hg18 human genome reference sequence appeared to contain a minor allele. Exclusion of such variants, which we provide in supplemental lists, will likely enhance identification of disease-causing mutations using exome sequence data. Hum Mutat 33:609–613, 2012. © 2012 Wiley Periodicals, Inc.

.*


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Identification of disease-causing genes among the variants generated by exome sequencing (ES) requires the separation of candidates with high pathogenic potential from variants that have a low probability for disease causation. Numerous well-described mechanisms can generate low-interest variants. Biological sources of low-interest variants include both common and rare population variation. Certain regions of the genome are unusually variable and the study of ES data from even a few individuals reveals genes that vary from the reference sequence in most, if not all, sequenced individuals.

High-throughput sequencing techniques also generate low-interest variants in the form of genotype false positives. Errors can arise from biases in the library construction [Aird et al., 2011; Bentley et al., 2008; Koboldt et al., 2010; Teer et al., 2010], errant polymerase reactions [Aird et al., 2011], difficulty making genotype calls at the end of short reads, loss of synchrony among DNA sequencing reactions within a cluster [Ledergerber and Dessimoz, 2011], or manufacturer/platform-specific mechanistic problems such as overlap in absorption spectra for guanine and thymine in the Illumina system [Dohm et al., 2008; Meacham et al., 2011]. Misalignment of sequencing reads to a reference sequence (RefSeq) and inaccuracies or biases of the RefSeq compared to a specific local population are other sources of false-positive genotype calls in next generation sequencing (NGS) data [Church et al., 2011]. Misalignment of short-length sequencing reads to a reference sequence are influenced by the choice of seed-based strategies or algorithms for complete alignment permutations [Homer and Nelson, 2010; Li and Durbin, 2009; Schatz et al., 2010]. These problems often arise in regions with low complexity [Landan and Graur, 2007] or result from misalignment of multiple copies of genes, paralogues, or pseudogenes [Blankenberg et al., 2010].

The reference sequence itself may be an additional source of variants. For some base positions, the reference sequence specifies a minor allele in most large human populations. Such biases occur because of the limited number of individuals, on which the original reference sequence was based, plus sequencing and alignment errors [Lander et al., 2001]. As a result, the NCBI human genome reference sequence includes minor variants, unique variants, and, possibly, disease-causing mutations [Snyder et al., 2010] (http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/).

Some of these variants can be identified and provisionally excluded during a search for disease-causing variants. Herein, we provide example exclusion lists based upon our accumulated hg18 ES data. In addition, it is important for researchers to generate similar exclusion lists from their own datasets to take into account errors that may be specific to the sequencing and analysis methodology or the human genome reference version used.

Methods

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Patients

Patients accepted into the NIH Undiagnosed Diseases Program (UDP) were enrolled in clinical protocol 76-HG-0238, approved by the Institutional Review Board of the National Human Genome Research Institute and gave written, informed consent. The patients were members of 29 different families and had unique and widely divergent phenotypes. These 29 families contained 55 founders and additional affected and unaffected siblings of the probands, summing to a total of 118 individuals. Of the 29 families, 5 have been diagnosed [see accompanying manuscripts; Gahl et al., 2011; Gahl and Tifft, 2011; Pierson et al., 2011] and several others have strong candidate gene leads identified by ES.

An additional anonymized dataset of 401 exome sequences derived from the ClinSeq™ study [Biesecker et al., 2009] was used as a cross-check on the population characteristics of specific variants discovered in the UDP data set.

DNA Extraction

DNA was extracted from 10 ml of peripheral whole blood using the Puregene kit (Qiagen, Inc., Valencia, CA) according to the manufacturer's protocol.

Exome Sequencing, Sequence Alignment, and Variant Annotation

Initially the Agilent (Santa Clara, CA) Human 38Mb all exome capture method was used for enrichment and, as the design improved to capture additional exons and previously unannotated genes, the 50Mb capture method was substituted. [Coffey et al., 2011; Gnirke et al., 2009]. One hundred and fifty base pair (bp) insert libraries were used for capture without indexing. The Illumina GAIIx platform was used to obtain paired end 76 bp and 101 bp sequencing reads [Bentley et al., 2008]. Potential duplicate reads arising from polymerase chain reaction (PCR) duplication were retained because the National Institutes of Health Intramural Sequencing Center (NISC) has observed that their PCR duplicate levels are consistently <10% of reads and that their genotypes have >99.9% concordance with genotypes from single nucleotide polymorphism (SNP) arrays.

Alignment to the human genome reference sequence (UCSC assembly hg18, NCBI build 36) was carried out using the efficient large-scale alignment of nucleotide databases (ELAND) package (Illumina, Inc., San Diego, CA). ELAND was used in such a way that paired-end reads were aligned independently, and those that aligned uniquely were grouped into genomic sequence intervals of about 100 kb. Those that failed to align were binned with their paired-end mates, thus making use of paired-end information not utilized by ELAND. Reads that mapped equally well in more than one location were discarded. Cross_Match (P. Green, http://www.phrap.org), a Smith–Waterman-based local alignment algorithm, was used to align binned reads to their respective 100 kb genomic sequence, using the parameters −minscore 21 and −masklevel 0. Cross_Match alignments were converted to the SamTools BAM format. Because of the large number of exome sequences already aligned to hg18, NISC has continued to use this as the reference for exome sequence alignment. To compare our exome sequences to the ClinSeq exome sequences, which we use as an internally controlled allele frequency filter, we did not realign to the hg19 reference. We also elected to align to hg18 because even though hg18 has technically been superseded by hg19, hg18 still has more UCSC annotation. Consequently, all positions within the main text and supporting information refer to hg18 genome coordinates.

Genotypes were called using bam2mpg2 (http://research.nhgri.nih.gov/software/bam2mpg) for all positions with high-quality sequence (Phred-like Q20 or greater) using a Bayesian algorithm (Most Probable Genotype [MPG]) [Teer et al., 2010]. Genotypes with an MPG score ≥10 had a >99.89% concordance to genotypes from SNP array data. Similar to the method for false-positive reduction in the GATK software [Depristo et al., 2011], an optional data quality filter MPG/coverage ratio ≥0.5 was also applied to reduce false positives due to alignment errors [Ajay et al., 2011; Wei et al., 2011]. Missense variants were then assigned a delta score depending on the predicted degree of severity for functional disruption using the Conserved Domain-based Prediction (CDPred) algorithm [Bell et al., 2011; Johnston et al., 2010; McLaughlin et al., 2010; Prickett et al., 2011] (http://research.nhgri.nih.gov/software/CDPred/). Variants with a CDPred delta score between -1 and -30 are classified as “predicted deleterious”. CDPred scores are based on well-annotated and manually curated protein domains when the variant can be aligned to an entry in the NCBI Conserved Domain Database (CDD). When alignment to a CDD entry is not possible, CDPred defaults to the BLOSUM 62 substitution matrix. The positive or negative magnitude of output scores is more restricted when the substitution matrix is used reflecting the paucity of data in those regions. CDPred was chosen over other programs such as SIFT [Ng and Henikoff, 2003] and Polyphen [Adzhubei et al., 2010] because it was easy to incorporate into an automated pipeline, provided suitable output characteristics for our analyses and performed similarly to SIFT and Polyphen (unpublished data).

Filtering and Statistical Analysis of Variants

The variant lists provided by NISC were sorted and filtered using the VarSifter software (http://research.nhgri.nih.gov/software/VarSifter) [Teer et al., 2012] and then exported to Excel (Microsoft Corp., Renton, WA) for further analysis. Boolean logic filtering was performed using the JavaSDK package implemented in VarSifter. Conditional exact Hardy Weinberg equilibrium (HWE) one-sided testing was performed on all the available data for the UDP variants using the conditional HWExact module with the “greater” option selected. The R language package was developed by Jan Graffelmann [Engels, 2009; Graffelman, 2010; Wigginton et al., 2005] (http://www.r-project.org, http://www-eio.upc.edu/∼jan).

Varsifter formatted Bedfiles mentioned in this manuscript (Supp. Files S1–S4) were generated using the online software analysis suite Galaxy (http://main.g2.bx.psu.edu/root) [Blankenberg et al., 2010; Goecks et al., 2010]. These bedfiles are provided as supporting information in the online version of this manuscript. Because the primary purpose of these files is to exclude suspected false-positive variants from the data being queried within Varsifter, the genome wide complement data option in Galaxy was used to meet the program's formatting requirements of only an “include bedfile positions” option. Therefore, although these files can be viewed using the UCSC genome browser (http://genome.ucsc.edu/), the display of these custom tracks shows variant positions in white and all other regions of the genome as a solid black line.

Comparison of Identified Variants with Other Platforms and Alignment Methods

The subset of putative confounding variants identified in the UDP exomes and meeting the criteria of MPG ≥ 10 and MPG/Coverage ≥ 0.5 were compared to four whole genome sequencing datasets generated with the Illumina HiSeq 2000 sequencer and paired end reads (100bp). The sequence data were aligned using the Burrows Wheeler Algorithm [Li and Durbin, 2009]. We also compared ES variants to 69 human genomes publically released from Complete Genomics, Inc. [Drmanac et al., 2010] using BEDtools (v2.12) and in-house Perl scripts.

Gene Exclusion List Generation

Developing the provisional gene name exclusion list began by grouping all variants for the 29 families by locus name. The total number of predicted deleterious variants per family at each locus was recorded and the loci were sorted by number of occurrences. Validated NCBI pseudogenes were identified in the latest version of Gene (http://www.ncbi.nlm.nih.gov/gene) and added to an alternative gene exclusion list.

Results

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

ES Variants Shared among Probands with Dissimilar Phenotypes

Patients enrolled in the UDP exhibit heterogeneous, striking, and unusual phenotypes that have eluded diagnosis and may reflect new diseases. To determine the genetic bases of these disorders, we performed ES on a subset of participants. Using the 38 Mb Agilent Human SureSelect Kit, which targets the NCBI Consensus Coding Sequence, and two GAIIX flow cell lanes gave an average of 2.8 Gb of aligned sequence bases per sample, and on average 88.9% of baited nucleotides had an MPG score ≥ 10. For the 50 Mb Agilent Human SureSelect Kit, we obtained an average of 3.9 Gb of aligned sequence bases per sample, and on average 88.8% of baited nucleotides had an MPG score ≥ 10. Further details on coverage statistics are provided in the accompanying manuscript of (Dias et al., 2012).

The total set of sequenced exomes comprised 118 individuals, including a founder subset of 55 individuals. Using a quality cutoff of MPG ≥ 10, a subset of 698,248 unique variants was detected in the total set, and subset of 549,242 in the founder set (Table 1). Many variants were recurrent, despite highly divergent phenotypes among the probands. The diseases represented by the UDP cohort are likely rare and highly penetrant. We reasoned that any ES variant shared by multiple families with different proband phenotypes is unlikely to be disease causing even if it is predicted to be deleterious by algorithms such as SIFT [Ng and Henikoff, 2003], Polyphen [Adzhubei et al., 2010] or CDPred [Johnston et al., 2010]. This provided a component of the rationale for creating lists of low-interest variants.

Table 1. Analysis of 698248 Sequence Variants Detected in ES Data Obtained from 29 UDP Families
Data setNumber of variantsNumber of genes
Variants arising in highly polymorphic genes
≥10 variants in all familiesN/A17
≥10 variants in ≥3 familiesN/A166
Variants arising from misalignment
Heterozygous in every exome39245
Excess heterozygosity23,3892,576
Variants arising from biases in the Hg18 human genome reference sequence
Homozygous non-Hg18 human genome reference sequence in every exome1,009707

Highly Variable Genes

We hypothesized that genes frequently containing numerous pathogenic variants had a low probability of being the source of disease-causing candidates for most exome/genome projects. Therefore, we sought to exclude genes that had frequent variations from the RefSeq that were plausibly pathogenic (missense, nonsense, frame shifting, canonical-splice-site modifying) and rare enough to remain after filtering out common polymorphisms. For some exclusion lists, we applied a software prediction of variant pathogenicity using CDPred. The genes we identified are enumerated in the lists Supp. Tables S1-S7 and are listed along with construction notes in Supp. Table S8. For our exome projects, we applied gene exclusion lists as a provisional filtration step, adding back subsets of the excluded genes if no convincing disease-causing variants were found.

Deviations from Hardy-Weinberg Equilibrium: Excess Heterozygosity

The presence of excess heterozygosity in a cohort of exome sequence data is suggestive of sequence-read alignment errors, wherein two similar sequences, each homozygous for a different nucleotide at one or more positions, are aligned. We investigated whether such patterns existed in our data and found 392 variants with an MPG ≥10 that were heterozygous in all 118 UDP exomes (Supp. File S1 and Supp. Table S9 (tab heterozygous_nonref_annotations.xls).

Previous publications concerning SNP [Doron and Shweiki, 2011] and exome data have proposed that the genotypes of misaligned sequences will be in Hardy-Weinberg disequilibrium [Engels, 2009; Graffelman, 2010; Wigginton et al., 2005]. The a priori probability of only heterozygous genotypes, based upon equal allele frequencies, is p ≤ (1/2) for the 55 independent founder genotypes and is p ≤ (1/2)−118 when including the entire cohort. Applying a Bonferonni correction of 7.0 × 10−5 to the 549,242 ES variants identified in the founders, we concluded that a conditional, single tailed, HWE exact p value of less than 7.0 × 10−8 would be significant for inclusion into a false-positive list at p < 0.05. Using this criterion, we identified 23,389 positions with excess heterozygosity (Supp. File S2 and Supp. Table S9 (tab heterozygous_nonref_annotations_2.xls); each variant had an MPG ≥10 in at least one exome.

We reasoned that, if these heterozygous variants arose from a compression block, a region where highly similar sequences are inadvertently compressed computationally [Roach et al., 2010], the two nearly identical component sequences that were misaligned might show up as copy number variations detected by other means. Confirming our suspicion, we found 15,140 variants in CNV regions listed in the Database of Genomic Variants (DGV) and, as identified by RepeatMasker, 2,104 variants within repeat regions and 593 variants within tandem repeats using SeattleSeq variant annotation (http://gvs.gs.washington.edu/SeattleSeqAnnotation/).

Comparing the 392 positions where heterozygosity was the genotype in every exome to the Agilent SureSelect baited regions, we found that fully half of the heterozygous variation in these positions arose from incidental capture of nontargeted regions. In addition, two baits had targeted regions that are now annotated as pseudogenes.

Deviations from Hardy–Weinberg Equilibrium: Excess Homozygosity

The presence of excess nonreference homozygosity for a given base pair in an exome cohort suggests that the reference sequence contains a minor-allele nucleotide designation—one that does not represent the major allele in the population from which the exome cohort was derived. We identified 1,009 positions in which every exome was homozygous for a nonreference genotype with an MPG ≥ 10. Using the UCSC genome browser to compare a subset of 187 randomly selected variants to cDNA sequences aligned to the hg18 human genome reference sequence, we found that in all cases where a cDNA sequence was available, the reference cDNA sequence agreed with the nonreference genotype call in our exome data. The 1009 nonreference homozygous variants are provided as a varsifter BED formatted file Supp File S3, and additional data about the variants are included in Supp. Table S9 (tab homozygous_nonref_annotations.xls).

Presence of Variants in dbSNP

The mechanisms discussed above may also produce DNA variant genotype calls with other types of genotyping technology. We searched dbSNPv130 to see if our variants had been previously reported. Of all the homozygous nonreference variants of high quality (MPG ≥ 10), 96.8% were in dbSNP. For the heterozygous variants identified by HWE testing, 68% were in dbSNP. The SNP reference numbers for all variants are provided in Supp. Table S9.

Discussion

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

To filter false-positive variants from exome sequence data and thereby aid discovery of disease-causing mutations, we present a set of three tools, that is, lists of highly polymorphic genes, positions of recurrent suspected misalignments to the human genome reference sequence, and positions at which the hg18 human genome reference sequence contains a minor allele. These lists, complementing previously published exome variant analysis tools [Bilguvar et al., 2010; Choi et al., 2009; Hoischen et al., 2010; Ng et al., 2010a; Ng et al., 2010b], were derived from analyses of exome data generated by the NIH UDP exome set, not from analysis of public databases. The tools are likely to be platform or alignment specific to some extent, particularly the list of heterozygous sites. However, they may be applicable, in general, to exome data collected with the widely used Agilent–Illumina sequencing technologies.

Our list of highly “polymorphic” genes starts with those identified empirically as having a large number of variants. We arbitrarily defined “large” as at least 10 predicted deleterious variants by the CDPred algorithm (CDPred ≤ −1). One method to select genes, which look to be enriched for false-positive variants, is to examine a diverse human population for genes that consistently generate many variants. This could be due to sequence alignment artifacts or true polymorphic nature of the genes; the ES samples obtained from the NIH UDP cohort constitute an ideal cohort of such individuals with disparate conditions [Gahl and Tifft, 2011]. To this list, we added genes of lower relevance for our UDP patients, such as members of the olfactory and taste receptor gene families. Although these genes could cause disease by duplication and divergence to a moonlighting function or by exerting a dominant negative effect, we considered these possibilities unlikely; also, adding them to our list of excluded genes on a first pass analysis does not preclude the option of revisiting them later. After thorough characterization of genes and regions where such variants are found, future refinements could involve excluding only the highly polymorphic regions of these genes rather than those entire gene loci.

Another common problem in analyzing ES data involves confounding variants arising from misalignment of sequences to the human genome reference sequence. These false positives can be identified using family data. Any ES data set, if produced in a consistent way, can be analyzed for HWE deviations. Given the large number of variant positions generated by ES, a conservative Bonferonni correction for multiple sampling was used to avoid spurious exclusion of variants that by chance appeared out of HWE. Furthermore, lists of false-positive variants derived by looking at HWE should be generated using a single alignment algorithm, as was done in this case. However, even when looking across platforms and algorithms at the 37% (146/393) of the ES variants where there are only heterozygous genotypes, we found concordance in more than 50 of the 69 publically available Complete Genomics genomes.

Another source of confounding variants arises when the hg18 human genome reference sequence contains a minor allele, rare disease-causing variant, or a simple sequencing error. Although some of our variants might occur due to systemic errors in NGS compared to the Sanger method used to generate the human genome reference sequence [Balasubramanian et al., 2011; van der Maarel et al., 2011], comparing NGS of exomes with that of Illumina genomes confirmed that the vast majority of the variant genotypes were correctly called. For 85% (863/1009) of the ES homozygous non-RefSeq genotypes, we also found concordance in more than 50 of the 69 publicly available Complete Genomics genomes. This suggests that these variants are not always platform, chemistry, or alignment specific. Fortunately, these errors will become less common as non-disease-causing variations in the human genome are identified and annotated [Church et al., 2011]. In fact, the accelerating accumulation of sequencing data continues to contribute to the accuracy of a variety of data sets including dbSNP and the human pan-genome [Li et al., 2010].

Many of the variants detected by HWE and exome-dataset analysis also occur in dbSNP. For the homozygous nonreference variants, this is not surprising, since minor alleles in the reference sequence will differ from sequence data obtained using any technology. For heterozygous variants, the percentage of variants in dbSNP is smaller, and may represent similar sequencing specificity issues as those that arise using NGS. Filtration using unselected dbSNP records introduces a well-described hazard of excluding important and possibly disease-causing variants. Identification of variants using the methods we describe allows for the construction of ES variant filters with known and rationally formulated characteristics.

In conclusion, incremental improvements in the analysis of genome data will occur with improved sequencing chemistry, better alignments, longer read lengths, deeper coverage, and advanced technologies that address inadequacies in long-range sequencing and gap filling [Homer and Nelson, 2010; Schatz et al., 2010]. For now, lists of problematic genes or variant locations (e.g., heterozygous genotypes with HWE inconsistencies or all homozygous non-human genome reference alleles) help to identify false-positive signals (Supp. File S4 and Supporting Information text). Such lists assist in the winnowing of ES variants and are essential for disease-causing gene discovery.

Acknowledgements

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

We thank our patients and their families, who are partners in the pursuits of the NIH UDP. We appreciate the excellent technical skills of Roxanne Fischer and Richard Hess. We also thank Dr. Ajay Shankar Subramaniam and Dr. Elliott H Margulies, who helped with the correlation of suspected false-positive variants in exome sequencing against Illumina genome sequences. We value the help from Taylor Davis, who assisted in the manual investigation of the “all homozygous” variants in our dataset using the UCSC browser and Dr. Praveen Cherukuri's advise regarding the use of CdPred for our dataset.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information
  • Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. 2010. A method and server for predicting damaging missense mutations. Nat Methods 7:248249.
  • Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, Jaffe DB, Nusbaum C, Gnirke A. 2011. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol 12(2):R18.
  • Ajay SS, Parker SC, Abaan HO, Fajardo KV, Margulies EH. 2011. Accurate and comprehensive sequencing of personal genomes. Genome Res 21:1498505.
  • Balasubramanian S, Habegger L, Frankish A, MacArthur DG, Harte R, Tyler-Smith C, Harrow J, Gerstein M. 2011. Gene inactivation and its implications for annotation in the era of personal genomics. Genes Dev 25:110.
  • Bell DW, Sikdar N, Lee KY, Price JC, Chatterjee R, Park HD, Fox J, Ishiai M, Rudd ML, Pollock LM, Fogoros SK, Mohamed H, and others. 2011. Predisposition to cancer caused by genetic and functional defects of mammalian Atad5. PLoS Genet 7:e1002245.
  • Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, and others. 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456:5359.
  • Biesecker LG, Mullikin JC, Facio FM, Turner C, Cherukuri PF, Blakesley RW, Bouffard GG, Chines PS, Cruz P, Hansen NF, Teer JK, Maskeri B, and others. 2009. The ClinSeq Project: piloting large-scale genome sequencing for research in genomic medicine. Genome Res 19:16651674.
  • Bilguvar K, Ozturk AK, Louvi A, Kwan KY, Choi M, Tatli B, Yalnizoglu D, Tuysuz B, Caglayan AO, Gokben S, Kaymakcalan H, Barak T, and others. 2010. Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations. Nature 467:207210.
  • Blankenberg D, VonKuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. 2010. Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol. 89:19.10.1–19.10.21.
  • Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, Nayir A, Bakkaloğlu A, Ozen S, Sanjad S, Nelson-Williams C, Farhi A, Mane S, Lifton RP. 2009. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci USA 106:1909619101.
  • Church DM, Schneider VA, Graves T, Auger K, Cunningham F, Bouk N, Chen HC, Agarwala R, McLaren WM, Ritchie GR, Albracht D, Kremitzki M, and others. 2011. Modernizing reference genome assemblies. PLoS Biol 9:e1001091.
  • Coffey AJ, Kokocinski F, Calafato MS, Scott CE, Palta P, Drury E, Joyce CJ, Leproust EM, Harrow J, Hunt S, Lehesjoki AE, Turner DJ, Hubbard TJ, Palotie A. 2011. The GENCODE exome: sequencing the complete human exome. Eur J Hum Genet 19:827831.
  • DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, Del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, and others. 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491498.
  • Dias C, Sincan M, Rupps R, Briemberg H, Selby K, Mullikin J, Markello T, Adams D, Gahl WA, Boerkoel CF. 2011. Exome sequencing: diagnosis of genetically heterogeneous neuromuscular disorders. Hum Mutat 33.
  • Dohm JC, Lottaz C, Borodina T, Himmelbauer H. 2008. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 36:e105.
  • Doron S, Shweiki D. 2011. SNP uniqueness problem: a proof-of-principle in HapMap SNPs. Hum Mutat 32:355357.
  • Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG, Carnevali P, Nazarenko I, Nilsen GB, Yeung G, Dahl F, Fernandez A, and others. 2010. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327:7881.
  • Engels WR. 2009. Exact tests for Hardy–Weinberg proportions. Genetics 183:143141.
  • Gahl WA, Markello TC, Toro C, Fajardo KF, Sincan M, Gill F, Carlson-Donohoe H, Gropman A, Pierson TM, Golas G, Wolfe L, Groden C, and others. 2012. The National Institutes of Health Undiagnosed Diseases Program: insights into rare diseases. Genet Med 14:5159.
  • Gahl WA, Tifft CJ. 2011. The NIH Undiagnosed Diseases Program: lessons learned. JAMA 305:19041905.
  • Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, Giannoukos G, Fisher S, Russ C, Gabriel S, Jaffe DB, Lander ES, Nusbaum C. 2009. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol 27:1829.
  • Goecks J, Nekrutenko A, Taylor J. 2010. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11:R86.
  • Graffelman J. 2010. The number of markers in the HapMap project: some notes on chi-square and exact tests for Hardy–Weinberg equilibrium. Am J Hum Genet 86:8138; author reply 818–9.
  • Hoischen A, van Bon BW, Gilissen C, Arts P, van Lier B, Steehouwer M, de Vries P, de Reuver R, Wieskamp N, Mortier G, Devriendt K, Amorim MZ, and others. 2010. De novo mutations of SETBP1 cause Schinzel-Giedion syndrome. Nat Genet 42:483485.
  • Homer N, Nelson SF. 2010. Improved variant discovery through local re-alignment of short-read next-generation sequencing data using SRMA. Genome Biol 11:R99.
  • Johnston JJ, Teer JK, Cherukuri PF, Hansen NF, Loftus SK, Chong K, Mullikin JC, Biesecker LG. 2010. Massively parallel sequencing of exons on the X chromosome identifies RBM10 as the gene that causes a syndromic form of cleft palate. Am J Hum Genet 86:743748.
  • Koboldt DC, Ding L, Mardis ER, Wilson RK. 2010. Challenges of sequencing human genomes. Brief Bioinf 11:484498.
  • Landan G, Graur D. 2007. Heads or tails: a simple reliability check for multiple sequence alignments. Mol Biol Evol 24:13801383.
  • Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, and others. 2001. Initial sequencing and analysis of the human genome. Nature 409:860921.
  • Ledergerber C, Dessimoz C. 2011. Base-calling for next-generation sequencing platforms. Brief Bioinform 12:489497.
  • Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25:17541760.
  • Li R, Li Y, Zheng H, Luo R, Zhu H, Li Q, Qian W, Ren Y, Tian G, Li J, Zhou G, Zhu X, and others. 2010. Building the sequence map of the human pan-genome. Nat Biotechnol 28:5763.
  • McLaughlin HM, Sakaguchi R, Liu C, Igarashi T, Pehlivan D, Chu K, Iyer R, Cruz P, Cherukuri PF, Hansen NF, Mullikin JC; NISC Comparative Sequencing Program, Biesecker LG, and others. 2010. Compound heterozygosity for loss-of-function lysyl-tRNA synthetase mutations in a patient with peripheral neuropathy. Am J Hum Genet 87:560566.
  • Meacham F, Boffelli D, Dhahbi J, Martin DI, Singer M, Pachter L. 2011. Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics 12:451.
  • Ng PC, Henikoff S. 2003. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res 31:38123814.
  • Ng SB, Bigham AW, Buckingham KJ, Hannibal MC, McMillin MJ, Gildersleeve HI, Beck AE, Tabor HK, Cooper GM, Mefford HC, Lee C, Turner EH, and others. 2010a. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet 42:790793.
  • Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA, Shendure J, Bamshad MJ. 2010b. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet 42:3035.
  • Pierson TM, Adams D, Bonn F, Martinelli P, Cherukuri PF, Teer JK, Hansen NF, Cruz P, Mullikin For The Nisc Comparative Sequencing Program JC, Blakesley RW, Golas G, Kwan J, and others. 2011. Whole-exome sequencing identifies homozygous AFG3L2 mutations in a spastic ataxia-neuropathy syndrome linked to mitochondrial m-AAA proteases. PLoS Genet 7(10):e1002325.
  • Prickett TD, Wei X, Cardenas-Navia I, Teer JK, Lin JC, Walia V, Gartner J, Jiang J, Cherukuri PF, Molinolo A, Davies MA, Gershenwald JE, and others. 2011. Exon capture analysis of G protein-coupled receptors identifies activating mutations in GRM3 in melanoma. Nat Genet 43:11191126.
  • Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT, Rowen L, Pant KP, Goodman N, Bamshad M, Shendure J, Drmanac R, Jorde LB, Hood L, Galas DJ. 2010. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328:636639.
  • Schatz MC, Delcher AL, Salzberg SL. 2010. Assembly of large genomes using second-generation sequencing. Genome Res 20:11651173.
  • Snyder M, Du J, Gerstein M. 2010. Personal genome sequencing: current approaches and challenges. Genes Dev 24:423431.
  • Stenson PD, Ball EV, Mort M, Phillips AD, Shiel JA, Thomas NS, Abeysinghe S, Krawczak M, Cooper DN. 2003. Human Gene Mutation Database (HGMD): 2003 update. Hum Mutat 21:577581.
  • Teer JK, Bonnycastle LL, Chines PS, Hansen NF, Aoyama N, Swift AJ, Abaan HO, Albert TJ, NISC ComparativeSequencing Program, Margulies EH, Green ED, Collins FS, Mullikin JC, Biesecker LG. 2010. Systematic comparison of three genomic enrichment methods for massively parallel DNA sequencing. Genome Res 20:14201431.
  • Teer JK, Green ED, Mullikin JC, Biesecker LG. 2012. VarSifter: Visualizing and analyzing exome-scale sequence variation data on a desktop computer. Bioinformatics 28:599600.
  • van der Maarel SM, Tawil R, Tapscott SJ. 2011. Facioscapulohumeral muscular dystrophy and DUX4: breaking the silence. Trends Mol Med 17:252258.
  • Wei X, Walia V, Lin JC, Teer JK, Prickett TD, Gartner J, Davis S; NISC Comparative Sequencing Program, Stemke-Hale K, Davies MA, Gershenwald JE, Robinson W, Robinson S, Rosenberg SA, Samuels Y. 2011. Exome sequencing identifies GRIN2A as frequently mutated in melanoma. Nat Genet 43:442446.
  • Wigginton JE, Cutler DJ, Abecasis GR. 2005. A note on exact tests of Hardy-Weinberg equilibrium. Am J Hum Genet 76:887893.

Supporting Information

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Additional Supporting information may be found in the online version of this article

FilenameFormatSizeDescription
S4_Base-pair_exclusion_list.txt488KSupporting Information
S1_heterozygous_nonref.txt10KSupporting Information
S2_heterozygous_nonref_2.txt465KSupporting Information
S3_homozygous_nonref.txt24KSupporting Information
Table_S1_empiric_final.txt3KSupporting Information
Table_S2_ten_in_ten_families_final.txt0KSupporting Information
Table_S3_ten_in_twenty_families_final.txt0KSupporting Information
Table_S4_ten_in_all_families_final.txt0KSupporting Information
Table_S5_pseudogenes_final.txt14KSupporting Information
Table_S6_ten_in_more_than_3_of_27_final.txt1KSupporting Information
Table_S7_gene_exclusion_list_final.txt17KSupporting Information
humu_22033_sm_Table_S8.pdf105KSupporting Information
Table_S9_final.xls5698KSupporting Information
Table_S9_legends_final.pdf562KSupporting Information
Supp_Mat-key_humu_22033_readme_R3.pdf77KSupporting Information

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.