SEARCH

SEARCH BY CITATION

Keywords:

  • exome;
  • next generation sequencing;
  • variant filtering;
  • Mendelian

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

The analysis of variants generated by exome sequencing (ES) of families with rare Mendelian diseases is a time-consuming, manual process that represents one barrier to applying the technology routinely. To address this issue, we have developed a software tool, VAR-MD (http://research.nhgri.nih.gov/software/var-md/), for analyzing the DNA sequence variants produced by human ES. VAR-MD generates a ranked list of variants using predicted pathogenicity, Mendelian inheritance models, genotype quality, and population variant frequency data. VAR-MD was tested using two previously solved data sets and one unsolved data set. In the solved cases, the correct variant was listed at the top of VAR-MD's variant ranking. In the unsolved case, the correct variant was highly ranked, allowing for subsequent identification and validation. We conclude that VAR-MD has the potential to enhance mutation identification using family based, annotated next generation sequencing data. Moreover, we predict an incremental advancement in software performance as the reference databases, such as Single Nucleotide Polymorphism Database and Human Gene Mutation Database, continue to improve. Hum Mutat 33:593–598, 2012. © 2012 Wiley Periodicals, Inc.*


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

A rare disease has a prevalence of fewer than 200,000 affected individuals in the United States [Office of Rare Diseases Research, 2011] or an incidence of less than 1 in 2,000 in Europe [EC_Regulation_on_Orphan_Medicinal_Products, 2011]. To date, approximately 7,000 distinct rare diseases have been described and affect 6–8% of the population creating a substantial health burden [Eurordis, 2005]. Many rare diseases are severe, early onset disorders without a cure or effective treatment [Schieppati et al., 2008]. Furthermore, limited experience with individual rare diseases may make them difficult to diagnose [Eurordis, 2005]. The consequences of diagnostic and therapeutic delays include emotional and economic burdens on families and on society as a whole.

Various nontraditional methods can improve the rate of diagnosis for rare diseases. Metabolomic and proteomic surveys of body fluids can shed light on the function of biochemical and signaling pathways. Clinical decision support systems rely on computational methods to correlate clinical phenotype and medical knowledge in order to identify a probable diagnosis of a known disease; accurate predictions, however, require the disease to be well characterized.

Genome sequencing (GS) and exome sequencing (ES) constitute a substantial advancement in the available diagnostic tools. Some observers predict that genomic methods hold the promise of diagnosing 80% of rare diseases with a genetic basis [Eurordis, 2005]. However, GS and ES analyses are not straightforward; 105–106 variants are detected per analyzed family, depending on technique. The multitude of variants comprises a new challenge when seeking individual disease-causing changes in the genome; the goal is to identify a small subset of variants that are economically and technically feasible to test individually. Currently, ES and GS data are manipulated and analyzed by bioinformatics groups with substantial computational resources. Such resources are often not readily available to the individual investigator in either the scientific or medical community. The National Institutes of Health (NIH) Undiagnosed Diseases Program, which focuses on diagnosing rare, phenotypically unique disorders [Gahl et al., 2011; Gahl and Tifft, 2011] has been developing methodologies to allow individual investigators or clinicians to handle and analyze large data sets.

One initial result of this work is VAR-MD, a tool that analyzes a set of ES variants and generates a ranked list of potential disease-causing candidates (http://research.nhgri.nih.gov/software/var-md/). Recently, several heuristic search tools have been published for personal genome data [Pelak et al., 2010; Wang et al., 2010], and probabilistic tools that can be more broadly applicable to complex diseases are being developed as well [Yandell et al., 2011]. Using VAR-MD, we demonstrate effective candidate ranking in three small families with rare Mendelian diseases using a combination of heuristic reasoning, Mendelian inheritance filtering, and amino acid substitution-based quantitative scoring and ranking of variants. For each family, exome sequence had been obtained. Two of the three families had known disease-causing mutations that were identified by the manual application of analytical steps similar to those that have been incorporated into VAR-MD. The proband in the first family had AFG3L2 (MIM# 604581)-related spastic ataxia–neuropathy [Di Bella et al., 2010; Pierson et al., 2011a], and the proband in the second family had GM1 gangliosidosis (MIM# 230500). The third family did not have a pre-existing diagnosis and VAR-MD was used to find disease-causing DNA sequence variants causing fatty acid hydroxylase-associated neurodegeneration (FHAN; MIM# 611026) [Dick et al., 2010; Kruer et al., 2010].

Materials and Methods

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Human Subjects

Patients gave informed consent or assent when enrolled in the clinical protocol 76-HG-0238, “Diagnosis and Treatment of Patients with Inborn Errors of Metabolism and Other Genetic Disorders,” approved by the National Human Genome Research Institute (NHGRI) Review Board.

DNA Extraction

Genomic DNA was extracted from peripheral blood using the Gentra Puregene Blood kit (Qiagen, Valencia, CA) as per the manufacturer's instructions. For ES, an additional chloroform/phenol extraction step was performed to neutralize infectious agents.

Single-Nucleotide Polymorphism Array

The NHGRI Genomics Core laboratory performed single-nucleotide polymorphism (SNP) determinations using the Illumina Bead Array Platform (1MDuo and OmniQuad1M arrays; Illumina Inc., San Diego, CA). Genotype call rates were over 99.7% for all samples. Genome-wide fluorescent intensities and genotype calls, as well as the genotype-specific fluorescent intensities were analyzed using Bead Studio and Genome Studio (Illumina, Inc.). Copy number variations were identified using the PennCNV software [Wang et al., 2007] and then validated by visual inspection in Genome Studio (Illumina, Inc.).

The ENT software program [Gusev et al., 2008] and/or Genome Studio Boolean logic filters were used to determine sites of recombination within single-family pedigrees. Segregation analysis was used to identify regions of the genome that had been inherited in a manner consistent with genetic models being considered for each family.

Exome Sequencing

Exome sequencing was performed on probands and genetically informative immediate family members by the NIH Intramural Sequencing Center. Solution hybridization exome capture was carried out using the Sureselect Human All Exon System (Agilent Technologies Inc., Santa Clara, CA). The manufacturer's protocol version 1.0, compatible with Illumina paired end sequencing, was used. The captured regions totaled approximately 38 or 50 Mb depending on the kit used. Flow cell preparation and 76-bp paired end read sequencing were carried out as per protocol for the GAIIx sequencer (Illumina Inc.). Approximately two lanes on a GAIIx flow cell were used per exome sample to generate sufficient reads to produce the coverage needed for high quality aligned sequence.

Sequence Alignment and Annotation

Reads were initially aligned using Efficient Large-Scale Alignment of Nucleotide Databases (ELAND; Illumina Inc.). ELAND alignments were used to place reads in bins of about five million base pairs. Unmapped reads were placed in the bin of the mate pair if the mate was mapped. Cross_match (P. Green, http://www.phrap.org) was used to align the reads assigned to each bin to the corresponding ∼5 Mb of genomic sequence. Cross_match alignments were converted to the SamTools bam format, and then genotypes were called using the Bayesian genotype-assigning algorithm bam2mpg [Teer et al., 2010]. The genotype call (mpg2bam) step annotates each genotype call with a log-based quality score that is used in the subsequent analytical and filtering processes.

Pathogenicity Annotation

The pathogenicity of each variant was based on its evolutionary sequence conservation using the CDPred tool [Johnston et al., 2010]. CDPred assigns a numeric score to each variation that can be aligned to a residue in the National Center for Biotechnology Information Conserved Domain database [Marchler-Bauer et al., 2011]. Deleteriousness scores for unaligned bases were assigned using the BLOSSUM62 scoring matrix.

Sanger Validation of Variants

We designed primers flanking the sequence variants of interest with the Primer3 program (http://frodo.wi.mit.edu/primer3/) (Table 1). With these primers, we performed polymerase chain reaction (PCR) amplification using HotStar Taq (Qiagen), with an initial denaturation step at 95°C for 1 min, followed by 35 cycles of denaturation at 95°C for 30 sec, annealing at 60°C for 30 sec, and extension at 72°C for 1 min; final extension time was 7 min at 72°C. After purification with ExoSAP-IT, the PCR products were sequenced with dye-terminator chemistry (Applied Biosystems, Foster City, CA), and the sequences were aligned using the Sequencher version 4.10.1 software (Gene Codes, Ann Arbor, MI). The mutations are reported according to standard nomenclature (http://www.hgvs.org/mutnomen/) [den Dunnen and Antonarakis, 2000]. Using sequence NM_006796.1 as a reference for AFG3L2, NM_000404.2 for GLB1, and NM_024306.4 for FA2H, we numbered the complementary DNA (cDNA) sequences such that the adenine of the ATG translation initiation codon is +1. Those mutations that are numbered based on the cDNA sequence are preceded by “c.” The protein reference sequence was derived from the open reading frame in NM_006796.1, NG_009005.1, or NG_017070.1 and those mutations that are numbered based on the amino acid sequence are preceded by “p.”

Table 1. Oligonucleotides Used for PCR and DNA Sequencing
GeneMutationForward primerReverse primer
GLB1p.R201HAGCTTGCATTAGGGTGGCTATCTCAATCTGCCCATGACAC
 p.G262VCTTGGGTGTTAAGTTCCAACATGACTCCACAATCCCATTAGC
FA2Hp.F236SCCCGTGAGTCACATCAAACTTTTGACTCAAGGACCCCAG

Analysis of Exome Variants

Variants that were detected using WES were analyzed using VAR-MD. VAR-MD is a Unix-based tool implemented in Python (www.python.org). VAR-MD has many independent functions that usually run sequentially. Some of these functions are rapid and some slower. For example, a VAR-MD pipeline that employs annotation steps with heavy database usage or generate a compound heterozygote model filter can require a total of approximately 4–6 hr to run for a typical exome data set on a desktop computer. To facilitate the analysis of multiple data sets in parallel, VAR-MD utilizes Galaxy [Blankenberg et al., 2010; Giardine et al., 2005], which in turn uses the distributed resource management application API to facilitate various distributed resource management systems (http://drmaa.org/). VAR-MD supports command line execution in Unix-like environments.

Results

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Development of VAR-MD

The input for VAR-MD is a list of variants generated by ES and formatted as shown in Supp. File S1. The minimum required columns include chromosome, nucleotide position, gene name, transcript name, amino acid change and position, genotype of each sample, and depth of coverage. For the best results in filtering and ranking, we recommend providing all columns shown in the example. VAR-MD requires the family structure, sex, and affected status of individuals in a separate text file (Supp. File S2).

Additional data that are useful for ranking the sequence variants include a genotype quality score, a numerical pathogenicity prediction, and the population frequency of the variant. Genotype quality scores for variants can be provided by tools such as the MPG [Teer et al., 2010] or SamTools [Li et al., 2009]. Tools for predicting the likelihood of pathogenicity/deleteriousness of mutations that change amino acids include CDPred [Johnston et al., 2010], PolyPhen II [Adzhubei et al., 2010], and others [Ng and Henikoff, 2006]. The population frequencies of sequence variants are derived from Single Nucleotide Polymorphism Database (dbSNP), local genotype data, and the 1,000 genomes project; VAR-MD requires the full paths to these files, as downloaded from UCSC genome browser database [Fujita et al., 2011], and is configured to use these files.

Analysis using VAR-MD can also be focused on specific genomic regions, genes, or gene lists. If certain regions of the genome are of interest because of prior data of linkage, homozygosity, hemizygosity, or clinical relevance, the genomic positions of these regions can be supplied in a bed format (UCSC Genome Browser). VAR-MD will annotate the final output accordingly and flag the variants that fall into the region of interest in the final ranked list. VAR-MD uses regions of interest to modify how candidate variants are ranked, but it does not exclude variants that are outside regions of interest. This approach keeps all variants in the list so that the analyst can access variants both inside and outside the candidate regions.

VAR-MD implements a stepwise filtering algorithm to exclude variants identified as having a low potential to be disease-causing genotypes and/or a high potential to be false-positive genotypes (Fig. 1). The tool first selects variants meeting a specified predicted deleteriousness of a mutation and then requires segregation according to Mendelian inheritance models. Subsequently, gene and linkage exclusion filters are imposed on the remaining variants. These steps form the basis of the filters that eliminate variants from further consideration.

thumbnail image

Figure 1. The VAR-MD algorithm.

Download figure to PowerPoint

For the remaining variants, subsequent analysis prioritizes them according to a calculated variant score based on genotype quality scores, coverage values, ratio of genotype call score to coverage, pathogenicity scores, and population frequency to give an ordered list from which the investigator can consider individual variants for further work.

Implementation and Validation of VAR-MD

To test if VAR-MD could parse sequence variants into a tractable list, we applied the above logic to three families, whose pedigrees are shown in Figure 2. Family 1 included two brothers (whose parents were first cousins) with the onset of spasticity, ataxia, dysarthria, dysphagia, and myoclonic epilepsy in late childhood [Pierson et al., 2011a]. Family 2 involved a 19-year-old girl with progressive loss of cognitive and motor function over the past 12 years. Extensive biochemical and genetic testing, muscle histopathology and respiratory chain biochemistry, and a leukocyte lysosomal enzyme screen did not detect the basis of her disease. Family 3 included a 12-year-old boy who developed limb spasticity, impaired eye movements, and scanning speech at age 3. By 10 years, he needed a walker and continued to regress. Extensive biochemical and genetic testing also failed to identify the cause of his disease. The steps in filtering and ranking, with corresponding reduction in numbers of candidate variants, are summarized for each family in Table 2.

thumbnail image

Figure 2. Pedigrees of Families 1, 2, and 3. Affected individuals are depicted by black symbols and unaffected individuals by open symbols.

Download figure to PowerPoint

Table 2. Number of Variants After Each Filter and Ranks Based on Variant
FamilyTotal number of detected variantsPredicted deleteriousHomozygous recessiveDominant/de novoX-linked recessiveCompound heterozygous pairsRank in sorted list
  1. aPresumed inheritance model.

  2. bThe other mutation that ranked higher than the pathogenic mutation was DMGDH, which is associated with dimethylglycine dehydrogenase deficiency. The clinical characteristics of the disease did not fit UDP369.

  3. cPresumed homozygous recessive inheritance gave 120 variants segregating according to the model but only three variants were in the region of hemizygosity.

1120,4797,00844a127152b
2135,2938,55013332055a1
3111,9797,490120 (3c)5321041

For all three families, VAR-MD was run with the same default options (i.e., CDPred deleteriousness cutoff of <0 to filter out predicted nondeleterious mutations) and configuration (Supp. File S3). VAR-MD's analysis uses a generic approach that branches and converges at steps (e.g., Mendelian filters) during the analysis to allow filtering and ranking for multiple genetic models during a single run. A sample output of VAR-MD using a homozygous recessive Mendelian model can be found in Supp. File S4.

For Family 1, the parents were first cousins and disease recurrence suggested autosomal recessive inheritance. Consistent with this, analysis of Illumina SNP arrays run on the family members identified regions of homozygosity shared between the two affected siblings (e.g., hg18 chr5:59,604,765-94,462,291; chr18:2,144,927-18,081,471). Imposing the pathogenicity cutoff and a homozygous recessive inheritance model, VAR-MD prioritized 44 variants, a 99.96% reduction from the initial variant list. Restricting the analysis to variants within homozygous regions shared between the two affected siblings reduced the list to 12 variants. The second ranked variant in the list of 12 was a homozygous mutation (NM_006796.1:c.[1847A>G];[1847A>G], p.[Y616C];[Y616C]) in AFG3L2 [Pierson et al., 2011a]. Biallelic mutations of AFG3L2 cause SCA28 and variants thereof consistent with the patients' presentation.

For Family 2, the affected child was a female, the parents were unrelated and there was no recurrence of disease in the family. This suggested an autosomal recessive or de novo dominant inheritance model. Choosing the default pathogenicity cutoff, VAR-MD identified 188 and 32 sequence variants meeting each of these inheritance models, respectively. VAR-MD then prioritized each list according to genotype quality score, predicted pathogenicity, and the population frequencies from multiple sources. Two variants (NM_000404.2:c.[785G>T];[602G>A], p.[G262V];[R201H]) [Pierson et al., 2011b] were identified in GLB1 (MIM# 611458). Biallelic mutations of GLB1 cause GM1 gangliosidosis, consistent with the patient's clinical presentation. Subsequent repeat clinical biochemical and molecular testing confirmed this diagnosis.

For family 3, the affected child was a male, parents were unrelated and there was no recurrence in this family. This suggests an autosomal recessive, X-linked recessive, or de novo dominant inheritance model. An accompanying SNP chip analysis identified the presence of a deletion (NM_024306.4:c.363-?_1119+?del) inherited from a parent. With inclusion of this hemizygosity, the current version of VAR-MD ranked the causative variant as the top candidate, whereas previous versions that did not take hemizygosity into account did not prioritize the causative variant because its inheritance did not follow the required Mendelian pattern. The variant (NM_024306.4:c.707T>C, p.F236S) was in the FA2H gene. Biallelic mutations of FA2H (MIM# 611026) cause FHAN, a phenotype consistent with that of UDP369 [Pierson et al., 2011c].

Discussion

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

VAR-MD is a software tool developed to prioritize sequence variants from ES. It generates a filtered, ranked list of candidates for inspection and functional validation. Although many aspects of variant filtering and ranking used by VAR-MD have been used previously, VAR-MD is the only currently available open source tool for combining the analyses in a customizable and automated process. Important VAR-MD features include the ability to allow a flexible set of variants (e.g., missense plus deletion) to identify a candidate, and the ability to synthesize data from multiple members of a small family. The latter capability is important because there are currently no free, nonproprietary tools available for matching variant pairs to identify candidate loci consistent with compound heterozygous or hemizygous inheritance. An additional strength of VAR-MD is that it provides a means to iterate variant analyses over a set of user-specified filtering and ranking criteria. For each iteration, the analyst can vary candidate regions, deleteriousness prediction thresholds, Mendelian filtering stringency, and other parameters as needed for individual projects.

Individual VAR-MD processing steps have been designed to include the highest quality subsets of available data. Previous ES analyses remove all variants found in dbSNP [Ng et al., 2010]. In contrast, VAR-MD uses a curated set of dbSNP and 1000-Genomes-Project variants for which allele frequencies have been reported [Altshuler et al., 2010]. VAR-MD also uses local sequence data to determine genotype frequencies and genotypes segregating in families.

Analysis of ES data requires an assessment of the quality of genotyping, including sequencing and alignment artifacts. VAR-MD can incorporate quality measures in its filtering and ranking steps. An example of such quality data is coverage, or the number of sequencing reads that have been assembled at a specified genomic site. In many cases, increased read depth equates with an increasingly accurate genotype call [Bentley et al., 2008].

VAR-MD can incorporate pedigree data into its variant filtering steps. Having sequence data from a family with parents and children (affected and unaffected) allows for the construction of powerful false-positive filters. Filtering is based in part on the fact that there are only approximately 175 new mutations per diploid genome per generation. Consequently, in coding regions, there are on average only two to three de novo nonsynonymous mutations per diploid genome per generation [Nachman and Crowell, 2000]. To circumvent elimination of true positive de novo variants, VAR-MD's filtering retains as viable candidates sequence variants that occur only in the affected individuals and not in the unaffected family members. These variants are reported in a separate dominant variant output list. In contrast, variants that do not segregate correctly (e.g., variants found in affected and unaffected members) are filtered out. For variants that have poor genotype quality scores in half of the family, results are reported but are listed further down the ranked list; this gives the reader a chance to consider them after examining the more reliably sequenced results.

VAR-MD is currently designed to work with small simple pedigrees and a defined group of genetic models. It will not perform as expected in situations where there is genetic heterogeneity or incomplete phenotyping. It will not incorporate data from half-siblings and other “nonnuclear” pedigree members. VAR-MD uses publicly available variant annotations (e.g., http://snp.gs.washington.edu/SeattleSeqAnnotation/index.jsp, http://www.ncbi.nlm.nih.gov/projects/SNP/) and may perform suboptimally when such information is lacking. Additional features and capabilities will be added to VAR-MD as resources permit.

Defining the phenotypic consequences of specific DNA variants is complex and often requires biological laboratory work to follow-up in silico analyses. In many cases, the exact nature of what constitutes a pathological variant awaits better delineation of what really is a healthy genome [Moore et al., 2011]. Parsing the entire human genome into beneficial, neutral, and pathogenic alleles, and defining all the interactions among alleles is still a daunting task for human genomics. VAR-MD is a powerful tool for the efficient application of variant filters and annotations to create a working list of pathogenesis candidates.

Diagnosing rare diseases remains a major problem for clinicians and researchers alike. The advent of affordable and practical GS opens many possibilities to address this issue. Tools such as VAR-MD facilitate diagnosis by improving the speed and accuracy of ES data analysis through an automated, flexible, and reliable platform.

Acknowledgements

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

The authors thank the NIH UDP patients, their families, and their physicians for making this enterprise a truly united effort. We thank Roxanne Fischer for her excellent technical assistance and the entire UDP staff for their dedicated service. The NHGRI Genomics Core provided superb SNP array results. We thank the NIH Intramural Sequencing Center for performing the whole exome and whole GS and analysis.

Disclosure Statement: The authors declare no conflict of interest.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information
  • Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. 2010. A method and server for predicting damaging missense mutations. Nat Methods 7:248249.
  • Altshuler DL, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A, Clark AG, Collins FS, De La Vega FM, Donnelly P, Egholm M, Flicek P, Gabriel SB, Gibbs RA, Knoppers BM, Lander ES, and many others. 2010. A map of human genome variation from population-scale sequencing. Nature 467:10611073.
  • Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, and many others. 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456:5359.
  • Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. 2010. Galaxy: A web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol 19:121.
  • den Dunnen JT, Antonarakis SE. 2000. Mutation nomenclature extensions and suggestions to describe complex mutations: A discussion. Hum Mutat 15:712.
  • Di Bella D, Lazzaro F, Brusco A, Plumari M, Battaglia G, Pastore A, Finardi A, Cagnoli C, Tempia F, Frontali M, Veneziano L, Sacco T, Boda E, Brussino A, Bonn F, Castellotti B, Baratta S, Mariotti C, Gellera C, Fracasso V, Magri S, Langer T, Plevani P, Di Donato S, Muzi-Falconi M, Taroni F. 2010. Mutations in the mitochondrial protease gene AFG3L2 cause dominant hereditary ataxia SCA28. Nat Genet 42:313321.
  • Dick KJ, Eckhardt M, Paisan-Ruiz C, Alshehhi AA, Proukakis C, Sibtain NA, Maier H, Sharifi R, Patton MA, Bashir W, Koul R, Raeburn S, Gieselmann V, Houlden H, Crosby AH. 2010. Mutation of FA2H underlies a complicated form of hereditary spastic paraplegia (SPG35). Hum Mutat 31:E1251E1260.
  • EC_Regulation_on_Orphan_Medicinal_Products. 2011. EC Regulation on Orphan Medicinal Products.
  • Eurordis. 2005. Rare diseases: Understanding this public health priority.
  • Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D, Cline MS, Goldman M, Barber GP, Clawson H, Coelho A, Diekhans M, Dreszer TR, Giardine BM, Harte RA, Hillman-Jackson J, Hsu F, Kirkup V, Kuhn RM, Learned K, Li CH, Meyer LR, Pohl A, Raney BJ, Rosenbloom KR, Smith KE, Haussler D, Kent WJ. 2011. The UCSC Genome Browser database: Update 2011. Nucleic Acids Res 39:D876D882.
  • Gahl WA, Markello TC, Toro C, Fajardo KF, Sincan M, Gill F, Carlson-Donohoe H, Gropman A, Pierson TM, Golas G, Wolfe L, Groden C, Godfrey R, Nehrebecky M, Wahl C, Landis DM, Yang S, Madeo A, Mullikin JC, Boerkoel CF, Tifft CJ, Adams D. 2011. The National Institutes of Health Undiagnosed Diseases Program: Insights into rare diseases. Genet Med 14:5159.
  • Gahl WA, Tifft CJ. 2011. The NIH Undiagnosed Diseases Program: Lessons learned. JAMA 305:19041905.
  • Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. 2005. Galaxy: A platform for interactive large-scale genome analysis. Genome Res 15:14511455.
  • Gusev A, Mandoiu II, Pasaniuc B. 2008. Highly scalable genotype phasing by entropy minimization. IEEE/ACM Trans Comput Biol Bioinform 5:252261.
  • Johnston JJ, Teer JK, Cherukuri PF, Hansen NF, Loftus SK, NIH Intramural Sequencing Center (NISC), Chong K, Mullikin JC, Biesecker LG. 2010. Massively parallel sequencing of exons on the X chromosome identifies RBM10 as the gene that causes a syndromic form of cleft palate. Am J Hum Genet 86:743748.
  • Kruer MC, Paisan-Ruiz C, Boddaert N, Yoon MY, Hama H, Gregory A, Malandrini A, Woltjer RL, Munnich A, Gobin S, Polster BJ, Palmeri S, Edvardson S, Hardy J, Houlden H, Hayflick SJ. 2010. Defective FA2H leads to a novel form of neurodegeneration with brain iron accumulation (NBIA). Ann Neurol 68:611618.
  • Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25:20782079.
  • Marchler-Bauer A, Lu SN, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR, Gwadz M, Hurwitz DI, Jackson JD, Ke ZX, Lanczycki CJ, Lu F, Marchler GH, Mullokandov M, Omelchenko MV, Robertson CL, Song JS, Thanki N, Yamashita RA, Zhang DC, Zhang NG, Zheng CJ, Bryant SH. 2011. CDD: A conserved domain database for the functional annotation of proteins. Nucleic Acids Res 39:D225D229.
  • Moore B, Hu H, Singleton M, Reese MG, De La Vega FM, Yandell M. 2011. Global analysis of disease-related DNA sequence variation in 10 healthy individuals: Implications for whole genome-based clinical diagnostics. Genet Med 13:210217.
  • Nachman MW, Crowell SL. 2000. Estimate of the mutation rate per nucleotide in humans. Genetics 156:297304.
  • Ng PC, Henikoff S. 2006. Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet 7:6180.
  • Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA et al. 2010. Exome sequencing identifies the cause of a mendelian disorder. Nature Genetics 42:30U41.
  • Office of Rare Diseases Research NIoH. 2011. Rare diseases and related terms.
  • Pelak K, Shianna KV, Ge DL, Maia JM, Zhu MF, Smith JP, Cirulli ET, Fellay J, Dickson SP, Gumbs CE, Heinzen EL, Need AC, Ruzzo EK, Singh A, Campbell CR, Hong LK, Lornsen KA, McKenzie AM, Sobreira NLM, Hoover-Fong JE, Milner JD, Ottman R, Haynes BF, Goedert JJ, Goldstein DB. 2010. The characterization of twenty sequenced human genomes. PLoS Genet 6:e1001111.
  • Pierson TM, Adams D, Bonn F, Martinelli P, Cherukuri PF, Teer JK, Hansen NF, Cruz P, Mullikin J, Blakesley RW, Golas G, Kwan J, Sandler A, Fuentes Fajardo K, Markello T, Tifft C, Blackstone C, Rugarli EI, Langer T, Gahl WA, Toro C. 2011a. Whole-exome sequencing identifies homozygous AFG3L2 mutations in a novel spastic ataxia-neuropathy syndrome linked to mitochondrial m-AAA proteases. PLoS Genet 7:e1002325.
  • Pierson TM, Adams D, Markello T, Golas G, Yang S, Sincan M, Simeonov DR, Fuentes Fajardo K, Hansen N, Cherukuri PF, Cruz P, Teer J, Mullikin JC, Boerkoel CF, Gahl WA, Tifft C. 2011b. Exome sequencing as a diagnostic tool in a case of undiagnosed juvenile-onset GM1-gangliosidosis. Neurology
  • Pierson TM, Simeonov DR, Sincan M, Adams D, Markello T, Golas G, Hansen N, Cherukuri PF, Cruz P, Mullikin JC, Blackstone C, Tifft C, Boerkoel CF, Gahl WA. 2011c. Exome sequencing and SNP analysis detects novel compound heterozygosity in fatty acid hydroxylase-associated neurodegeneration. Eur J Hum Genet
  • Schieppati A, Henter J-I, Daina E, Aperia A. 2008. Why rare diseases are an important medical and social issue. The Lancet 371:20392041.
  • Teer JK, Bonnycastle LL, Chines PS, Hansen NF, Aoyama N, Swift AJ, Abaan HO, Albert TJ, Margulies EH, Green ED, Collins FS, Mullikin JC, Biesecker LG. 2010. Systematic comparison of three genomic enrichment methods for massively parallel DNA sequencing. Genome Res 20:14201431.
  • UCSC Genome Bioinformatics Group. UCSC Genome Browser. Bed File Format. Center for Biomolecular Science & Engineering: University of California, Santa Cruz, CA.
  • Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF, Hakonarson H, Bucan M. 2007. PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res 17:16651674.
  • Wang K, Li MY, Hakonarson H. 2010. ANNOVAR: Functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38:e164.
  • Yandell M, Huff C, Hu H, Singleton M, Moore B, Xing J, Jorde LB, Reese MG. 2011. A probabilistic disease-gene finder for personal genomes. Genome Res 21:15291542.

Supporting Information

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and Methods
  5. Results
  6. Discussion
  7. Acknowledgements
  8. References
  9. Supporting Information

Additional Supporting information may be found in the online version of this article

FilenameFormatSizeDescription
humu_22034_sm_SuppInfo.pdf41KSupporting Information
humu_22034_sm_SuppInfo2.txt9663KSupporting Information
humu_22034_sm_SuppInfo3.txt1KSupporting Information
humu_22034_sm_SuppInfo4.txt1KSupporting Information

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.