• genomics;
  • next generation sequencing;
  • exome;
  • molecular diagnosis


  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Discussion
  6. Acknowledgements
  7. References
  8. Supporting Information

The Undiagnosed Diseases Program at the National Institutes of Health uses high-throughput sequencing (HTS) to diagnose rare and novel diseases. HTS techniques generate large numbers of DNA sequence variants, which must be analyzed and filtered to find candidates for disease causation. Despite the publication of an increasing number of successful exome-based projects, there has been little formal discussion of the analytic steps applied to HTS variant lists. We present the results of our experience with over 30 families for whom HTS sequencing was used in an attempt to find clinical diagnoses. For each family, exome sequence was augmented with high-density SNP-array data. We present a discussion of the theory and practical application of each analytic step and provide example data to illustrate our approach. The article is designed to provide an analytic roadmap for variant analysis, thereby enabling a wide range of researchers and clinical genetics practitioners to perform direct analysis of HTS data for their patients and projects. Hum Mutat 33:599–608, 2012. © 2012 Wiley Periodicals, Inc.



  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Discussion
  6. Acknowledgements
  7. References
  8. Supporting Information

The NIH Undiagnosed Diseases Program (UDP) is designed to evaluate medical syndromes that have been refractory to diagnosis despite extensive assessment [Gahl et al., 2011; Gahl and Tifft, 2011]. Once accepted, participants undergo in-depth medical evaluation at the NIH Clinical Center. Of the individuals or families seen at the NIH, 10%–20% are diagnosed with a known condition based on clinical evaluation. The remaining participants become candidates for research studies designed to detect ultrarare or new diseases that would be difficult, if not impossible, to diagnose using conventional means.

High-throughput sequencing (HTS) has emerged as a powerful tool to study undiagnosed diseases. Many recent publications describe new genes discovered by whole exome sequencing [Bilguvar et al., 2010; Bonnefond et al., 2010; Choi et al., 2009; Erlich et al., 2011; Hoischen et al., 2010; Kalay et al., 2011; Klein et al., 2011; Krawitz et al., 2010; Lalonde et al., 2010; Ng et al., 2010a; Ng et al., 2010b; Puente et al., 2011; Simpson et al., 2011; Sobreira et al., 2010; Walsh et al., 2010; Wei et al., 2011; Worthey et al., 2011], and additional publications report genes identified by related techniques [Brkanac et al., 2009; Johnston et al., 2010; Kahrizi et al., 2011; Lupski et al., 2010; Nikopoulos et al., 2010; Rehman et al., 2010; Rios et al., 2010; Summerer et al., 2010; Volpi et al., 2010].

HTS methods produce a list of genotype calls numbering on the order of 104 per exome, 105 for the combined exomes of a small family, and 106 per genome. The genotype list contains common polymorphisms, rare variants, and false positives. In the early stages of analysis, variants are prioritized and filtered to produce a subset of potentially disease-causing candidate variants. Filtering is based on factors such as population frequency, segregation according to a proposed genetic model, and predicted consequences for gene function. In addition, many of the published HTS diagnostic successes to date have made use of clues that were present before sequencing commenced. Examples include linkage data [Rehman et al., 2010], regions of homozygosity [Walsh et al., 2010], the presence of non-physiologic metabolites [Rios et al., 2010], and clinical similarity to known syndromes.

Application of HTS techniques to the UDP participant cohort is challenging due to the paucity of presequencing clues. Many families have apparently unique syndromes and no history of consanguinity. The available family members often comprise a pedigree that is too small for traditional linkage methods. The nature of the cases has driven the development of methods to maximize the information obtained from small families and/or individuals. Using both previously described and novel techniques, we have found disease-causing mutations in 5 of 30 families to which HTS methods have been applied. A number of additional families have generated highly suggestive candidates that are undergoing functional validation.

In this article we describe the step-by-step process used to analyze DNA sequence variants produced by HTS for our participants in the Undiagnosed Diseases Program (UDP). We provide a composite/artificial set of exome data to assist with the implementation of our techniques at other sites, where similar clinical work is being performed. For each step, we provide a discussion of the rationale behind our approach, a description of how to carry out the analysis with the example data set, and a brief discussion of the tools available for similar analyses. It is our intention to describe an approach that small- and medium-sized centers can use with their own patients, using next generation sequencing (NGS) data obtained by collaboration or from commercial sources.


  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Discussion
  6. Acknowledgements
  7. References
  8. Supporting Information

Supp. Table S1 provides a beginning-to-end outline of the major steps involved in exome sequencing. Most of the discussion in this article focuses on the “Variant Filtering and Analysis” step in that table. The table can be used to provide some context for the following discussion.

Starting Dataset Acquisition, Annotation, and Characteristics


The starting point for our analysis is a list of annotated DNA sequence variants—the candidate list. As the analysis proceeds, groups of variants will be tentatively removed from the candidate list until there are few enough variants that each may be scrutinized on an individual basis.

The starting candidate list is the product of the following generalized steps: data acquisition (generating sequence short reads from DNA); alignment (matching the short reads to a preexisting reference genome) [Lin et al., 2011; Miller et al., 2010; Schatz et al., 2010]; base calling (determination of the best guess for the genotype, or other sequence feature, at each aligned position) [Ledergerber and Dessimoz, 2011]; and annotation. These steps have been reviewed elsewhere [McKenna et al., 2010]. The term annotation, as used here, requires special mention. Annotation involves multiple procedures used to gather and record information about each detected sequence variant. Examples include, but are not limited to, the alignment of the variant to a specific base position in a known gene; the assessment of the variant's potential to disrupt gene function (“pathogenicity”); and the presence of the variant in databases such as dbSNP. Many annotations can be accomplished with free, publicly available tools such as the Genome Analysis Toolkit (GATK) [McKenna et al., 2010], SeattleSeq (, and the Galaxy web site [Giardine et al., 2005; Goecks et al., 2010; Taylor et al., 2007]. A few types of annotation are generated using custom programs developed at individual sequencing centers. For smaller sites lacking the bioinformatics resources of the large centers, the performance of some annotation procedures may be negotiated with a collaborating academic sequencing center or commercial vendor. In any case, a commitment to ongoing communication between the sequencing center and the researcher should be a prerequisite for any collaboration.

For the purposes of this article, a specified set of annotations will be assumed to have been performed before candidate list analysis begins. A few annotations are performed by software that is not yet freely available in a stand-alone form. While those annotations are not absolutely necessary, omission will result in a longer final candidate list. As an alternative, we have developed of a suite of Linux-based software scripts called VAR-MD and report this in a separate publication [Sincan et al, 2012]. VAR-MD will provide the variant annotations used in this article starting with a basic set of genotype calls. It will also automate many of the analytic procedures described below. Overlapping functionality is also available in the VAAST program, a recently released tool that can automate some annotation and candidate list manipulation tasks [Yandell et al., 2011].

Our candidate lists are provided by our collaborators at the NIH Intramural Sequencing Center (NISC) in the form of tab-delimited text files with one variant per line. The included annotations and potential data sources are outlined in Table 1. The NISC methodology used to generate the exome data in this article is outlined in Supp. Methods S1. A wide variety of computer programs can be used to view and manipulate a candidate list. We use a Java program called VarSifter, developed by Jamie Teer at NISC (Teer et al., 2012, available at Our candidate list, with accompanying annotations, is in a text-file format readable by VarSifter. The VarSifter file format, including information that is common to all similar files, is detailed in Supp. Methods S2. Alternately, many candidate list manipulations can be carried out using the Galaxy web site [Blankenberg et al., 2010; Goecks et al., 2010], GATK, and/or a spreadsheet such as Microsoft Excel (Microsoft Corporation, Renton, WA). Commercial solutions are available, and some offer alignment and/or annotation functionality as well, for example, Nextgene (State College, PA) and the tools provided with sequence data generated by Knome (Cambridge, MA).

Table 1. Example Candidate List Annotations
Item nameAnnotation sourcesaImplementation notes
  1. aThese are incomplete lists. A broad and rapidly expanding list of tools is available.

  2. bOften a collaborating sequencing facility can provide some or all of the annotations listed here. Most of the annotations can be carried out separately if needed. However, synergistic benefits can accrue if assembling and genotyping are performed by the same team. Definitions: GATK: The Genome Analysis Toolkit, VAAST: Variant Annotation, Analysis and Search Tool, BED: “Browser Extensible Data” (a common text file format for defining genomic regions), NISC: NIH Intramural Sequencing Center, NCBI: National Center for Biotechnoogy Information, Samtools (, BED tools ( The remaining tools are discussed in the text.

Identifier (unique for each variant in candidate list)Sequencing/assembling/genotyping facility,b GATK 
Chromosome numberSequencing/assembling/genotyping facility,b GATK 
Variant position within chromosomeSequencing/assembling/genotyping facility,b GATKPositions are given in the context of a specific reference genome, for example NCBI hg18/build36
Reference alleleSequencing/asembling/genotyping Facility,b GATK 
Variant alleleSequencing/asembling/genotyping facility,b GATK 
Variant type (exon, intron, etc.)Annovar, SeattleSeq, GATK, VAAST 
Gene nameAnnovar, SeattleSeq, GATK, VAAST 
TranscriptAnnovar, SeattleSeq, GATK, VAAST 
StrandAnnovar, SeattleSeq, GATK, VAAST 
Reference Amino AcidAnnovar, SeattleSeq, GATK, VAAST 
Variant Amino AcidAnnovar, SeattleSeq, GATK, VAAST 
Amino Acid PositionAnnovar, SeattleSeq, GATK, VAAST 
Pathogenicity ScoreGalaxy, GATK, PolyPhen, many othersNISC provides “CDPred” score
CoverageSamtools, BED tools, GATK 
Quality MeasureSamtools, GATKNISC provides MPG and MPG/coverage scores. Quality scores should be calibrated to a specific sequencing center/source
Mendelian consistency for various genetic modelsManual inspection with spreadsheet, VAR-MD, VAASTNISC provides annotation with in-house software
Compound heterozygote pairing for autosomal recessive genetic modelManual inspection with spreadsheetNISC provides annotation with in-house software

Genome sequencing will eventually become standard for many HTS applications. Until that time, however, the addition of genome-wide data from a high-density SNP array has the potential to add critical additional information to an HTS project, particularly in the case of exome analysis. We obtain SNP array data for every HTS project. We use the Illumina platform and the associated analysis program Genome Studio (Illumina, San Diego, CA). Other types of SNP arrays would be equally suitable.

The guiding principle behind our filtering procedure is that an HTS variant-analysis process must be flexible enough to allow adjustment of all analytic parameters. Those performing the analysis must understand the rationale, procedures, and assumptions inherent in each step.


The files used in the following analyses are available in one of two places. An example data set and interval postprocessing results are located at The example dataset compexome_30_unfiltered.vs is an exome candidate list created and modified from several projects to protect individual patient data. Each included project involves a family with a similar structure: four individuals including two parents and two full sibs. One sibling is affected with a disorder that appears to be early-onset, severe and likely to be highly penetrant at an early age. There is no history of consanguinity. High-density SNP arrays have been run for each family member. Individual variations are all biologically derived and there is one verified positive finding in the dataset. The positive finding in the example dataset was found in a family for which the affected child had a childhood-onset neurodegenerative disorder. A number of consistent known diseases, including some lysosomal storage diseases, had been ruled out by specific clinical testing. The story of the original exome-based diagnosis for that family reviewed in a separate publication [Pierson et al., 2011].

Genotyping Quality Measurement


HTS technology and methods are evolving rapidly. In addition to falling prices, aspects of the laboratory techniques used for data generation change every few months. Interpretation of an HTS candidate list requires an understanding of the genotyping-quality issues associated with the specific techniques used to acquire the data. Excellent reviews of HTS quality assessment are available [Teer et al., 2010]. Quality for a given project should be assessed by, or with the group who performed the data acquisition. Only that group can provide historical data about their experience with the specific techniques they use. Key issues include variant-call quality near the ends of sequence reads and assemblies, quality of insertion/deletion variant calling, and assessment of presequencing laboratory work.

The average depth of HTS short reads in a sequence alignment is a frequently reported metric of variant-call quality. Coverage for an entire HTS project can be reported in different ways such as “average coverage per base” or “percent of bases covered to depth n.” An example of one potential pitfall of using coverage as the sole measure of variant-call quality is the compression misalignment. In a compression, reads from two highly similar regions, for example, a gene and matching pesudogene, are aligned to the same position on the reference sequence. The two slightly-different sequences create apparent non-reference genotype calls where they differ, and simultaneously create an area of falsely reassuring deep coverage.


Quality assessment metrics for our data were developed by NISC and include a Bayesian statistic for each base call (the Most Probable Genotype or MPG score) and a ratio of the MPG score to the coverage for any given variant [Teer et al., 2010]. The latter makes intuitive sense. The quality score should increase in proportion to the coverage. A deeply covered variant with an inadequately high-quality score may indicate a false-positive genotype call. For the example dataset, variants have been included if at least one family member exceeds a lower cutoff for quality. The lower cutoffs for the MPG and MPG/coverage were empirically derived and set at MPG = 10 and MPG/coverage = 0.5.

Candidate List Filtering: Variant Type


Each analyst must define a starting point with regard to assumptions about the nature of the DNA change(s) affecting their gene of interest. Our usual starting assumptions have failed in some cases, and proven successful in others. Failure to find a convincing candidate simply prompts an additional pass through the data with different assumptions.


As a first pass, we will guess that the disease-causing variation, or variations, involves coding sequence or a canonical splice site. We will further postulate that it will be a typical pathogenic variant, for example, a missense change versus a less common type such as a synonymous splice modifier. After loading compexome_30_unfiltered.vs into VarSifter, the number of variant positions displayed is 116,837—a typically large number for a family of four. The following variant types are selected: insertions/deletions, missense mutations, nonsense mutations, and canonical splice-site mutations. Selecting those variants and applying the filter reduces the number of variants to 14,338 (compexome_31_pathogenic_variants.vs). The mechanism by which filtering occurs is straightforward. VarSifter uses one column of the candidate list file (“type”) to look up the annotated mutation type. Any mutation types not included in the filter are removed from the current view. To relax the criteria, intronic and other mutation categories may be added, followed by refiltering of the original data.

Candidate List Filtering: Population Frequency


Filtering by population frequency is an attempt to remove common polymorphisms that are unlikely to be disease causing. It is conversely equivalent to the practice of reporting of negative results in a panel of normal controls when describing a new mutation. The disease-causing variant is implicitly assumed to be rare, high penetrance, and responsible for a large phenotypic effect.

dbSNP [Sherry et al., 2001] is highly utilized public database of DNA sequence variations. Entries have a regular format, but are not curated and have nonrequired fields. Most of the HTS analysis articles to date have used dbSNP entries as a filter to remove common variations. Unintentional generation of false-positive or false-negative filtering results can occur with inappropriate application of the dbSNP database. Many dbSNP entries lack population frequency information and/or derive from studies with few individuals. The dbSNP database is known to contain pathogenic mutations; it was never designed to exclude them. In a 2008 study, Won et al. demonstrated that 8% of the sequence variations in dbSNP (v.126) were also present in the Human Gene Mutation Database (HGMD) [Stenson et al., 2008; Won et al., 2008]. The HGMD (BIOBASE Biological Databases, Wolfenbüttel, Germany) is ostensibly a list of human disease-causing variations, although it is known to be only as good as the medical literature it collates. The HGMD/dbSNP overlap serves to illustrate the potential for misclassification of DNA sequence variants by using an unselected database.

The 1000 Genomes Project is increasingly providing an invaluable resource for identifying common DNA sequence variations. It is available as a subset of current versions of dbSNP or by itself from the 1000 Genomes web site [Sudmant et al., 2010]. The 1000 Genomes variants are annotated with heterozygosity information allowing for the construction of filters with a specified lower limit of population heterozygosity. Determining an appropriate heterozygosity exclusion criterion requires an estimate of disease incidence. For ultrarare conditions in Hardy–Weinberg equilibrium and with incidences of the order of 1:1,000,000, the expected heterozygosity in the population is 1/500 or 0.002 (0.2%). For a condition with an incidence of 1/10,000, it is 2%. It is preferable to set the criterion too high rather than too low as the latter will run the risk of excluding the disease-causing variation being searched for.


Varsifter allows filtering using BED-formatted text files, the BED format providing a means to define arbitrary genomic intervals. Recent developments at the Galaxy web site allow for the rapid construction of BED files with dbSNP data. Filters should be re-constructed with each dbSNP release as new data are added regularly. Table 2 shows the results of population frequency filtering with files for several different heterozygosity cutoffs including 0.5%, 1%, 2%, and 5%. The 1% heterozygosity filter in Table 2 was applied using both dbSNP131 and dbSNP132 to highlight the fact that using updated filters is important to maximize the number of excluded variants. The dbSNP132 filter excluded 3000 more variants than the prior version. Each BED file is available at For our filters, we use a subset of dbSNP that includes 1000 Genomes data and HapMap variants/polymorphisms that align uniquely to the genome. A method for constructing such filters using the Galaxy web site is provided as Supp. Methods S3. Using the dbSNP132 1% filter, the example candidate list is reduced in size from 14,338 to 5,041 (compexome_32_DB132.vs). The population filtering threshold can be adjusted by creating files for various SNP heterozygosity cutoffs and substituting files as desired to adjust filtration.

Table 2. Population Frequency Filtering of Candidate List from Exome Sequencing
FilenamedbSNP versionHeterozygosity cutoffStarting variants (full candidate variant list)Postfiltering variants
  1. aOnly includes HapMap and 1,000 Genomes Project data, and only uniquely mapping sites

  2. Note that filtration numbers are all in the same order of magnitude, suggesting that the majority of the excluded SNPs are relatively common and appear in all of the filters. An additional ∼3,000 SNPs were filtered out by updating the filter from db131 to db132 highlighting the fact that the databases are significantly updated between releases.


Candidate List Filtering: Gene and Site Exclusion Lists


Some sequence variants can often be excluded a priori during a first pass analysis. Two types of exclusion are explored in Fuentes Fajardo et al. [Fuentes Fajardo et al, 2012]. Excluded genes contain multiple variants in every HTS-sequenced individual and are identified by restrospective analysis of accumulated exome data. These genes may fall into one of several categories: pseudogenes, groups of paralogs such as olfactory receptors, and/or chromosomal regions with biologically important hypervariability. An example of the last is the HLA region on chromosome 6. In addition, individual base pairs can be excluded. Base-pair exclusions are made based on the meta-analysis of a collection of exomes, preferably from one sequencing center and set of related sequencing methods. Examples include sites that are always heterozygous (likely to be caused by alignment problems specific to a given alignment methodology) and sites that are always homozygous nonreference (sites where the reference sequence contains a minor allele).

Occasionally, certain projects will require the reinclusion of typically excluded genes or sites. The analyst should be familiar with the contents of any exclusion lists employed, so that modifications can be made as needed.


We use two exclusion lists developed using the techniques referred to in Fuentes Fajardo et al. [Fuentes Fajardo et al, 2012]. The gene list is a text file with gene names, and the individual-base-pair list is a complemented BED file similar to the one used in the earlier population frequency filter. Application of the base pair exclusions reduces the candidate list from 5,041 to 3,752 (compexome_33_HWE_BEDfile_2.vs), and application of the gene list reduces the number further to 2,360 (compexome_34_Gene_Kill_List.vs). The BED file may be specific to our data acquisition methods, but the gene list should be useable by other centers. Both files are provided as gene exclusion_list.txt and base-pair_exclusion_list.txt, respectively.

Candidate List Filtering: Genotyping Quality Criteria


Low-confidence genotype calls may be removed during the data acquisition and annotation process. Only highly compelling criteria should prompt such variant removal. In the remaining cases, a quality score can be used to provide guidance to the candidate list analyst. Take as an example a case where three out of four family members have good quality data suggesting an important candidate variant. The variant may deserve consideration despite the fact that one family member has poor-quality data. Such variants are examples of what to revisit if an answer is not found during a first pass analysis. Genotyping quality scores, therefore, represent an additional variable that can be used to adjust filtration stringency.


As mentioned previously, our collaborators at NISC use the MPG score and MPG score/coverage ratio to annotate variant quality. The VarSifter program allows specification of the number of family members in a pedigree who need to exceed a given cutoff for inclusion in the postfiltration list. For our example, we specify that all four family members need to have an MPG score of at least 10 and an MPG/coverage score of 0.5 or greater. The subsequent filtration reduces the number of variants from 2,360 to 1,469 (compexome_35_Quality_filters.vs).

Candidate List Filtering: Family Structure


In the near past, HTS data acquisition costs were frequently a limiting factor in experimental design. As costs drop, data acquisition feasibility is giving way to other design issues. One consideration is whether or not to sequence additional family members, beyond the proband. Added family members have the potential to directly and substantially decrease the number of candidate variations in an HTS project. Figure 1 illustrates the effect of the incorporation of family data on final candidate list size.

thumbnail image

Figure 1. Cumulative Filtration of Exome Variant Lists from 22 Families. A set of 22 exome projects is displayed using two different analytical approaches: one uses all available family data (black) and the other uses only data from the proband (red). The y-axis is the log10 of the number of cumulatively filtered, residual variants. The x-axis shows filtration steps, which are sequential from left to right. The last two steps (homozygotes and heterozygotes) both use the set of variants produced from the preceding quality filtration step and are not sequential. Note that the implementation of homozygote and heterozygote filters differs between single exome analyses and family-based analyses. Mendelian segregation and phase information is not available in the case of single exome analysis. Homozygotes are not checked for inheritance from both parents. The “heterozygote” count is a tabulation of all pairwise combinations of variants for those cases where more than one heterozygous variant is found in the same gene. Single exome projects start with fewer variants and end with a larger number of candidates for further study. See the text for a further explanation of the various filtration steps.

Download figure to PowerPoint

Added family members can be analyzed with concurrent SNP array analysis to provide recombination mapping (precise segregation-consistent chromosomal intervals) [Roach et al., 2010], mosaicism detection [Gonzalez et al., 2011; Markello et al., 2011a] identification of regions of homozygosity, estimates of inbreeding coefficients, confirmation of parentage, uniparental disomy analysis and detection/interpretation of copy number variations. As an example, if a proband and a father share a single copy deletion, then the sequence of the corresponding maternal allele in the proband should be interrogated for possible complementary loss-of-function variations that might generate a phenotype when paired with the paternally-inherited deletion. If the same deletion is new to the proband, then a different set of mechanisms can be considered including haploinsufficiency or a complementary variant inherited from whichever parent contributed the non-deleted allele.

While recombination mapping can be performed using genome sequencing data, exome projects require the addition of genome-spanning high-density SNP array data. Construction of recombination maps using SNP data is described in an accompanying article by Markello et al. [Markello et al., 2011b]. Recombination mapping is analogous to traditional linkage analysis, which produces variable likelihood-based estimates of linkage between widely spaced markers. The close proximity of SNPs on a high-density SNP array means that the probability of a double-crossover event between a given pair of markers is small. Consequently, sites of recombination can be mapped in a “square wave” fashion, with regions of consistent and non-consistent segregation mapped to a precision on the order of a few kilobases. For exome candidate list analysis, regions that have segregated in a manner consistent with a given genetic model can be defined with a BED file. Variants outside the consistent regions are filtered out.

Consistent segregation can also be verified for individual variants [Choi et al., 2009; Ng et al., 2010b]. The group of variants filtered by recombination mapping overlaps but is not identical to the set of variants excluded by individual-variant segregation filtering. The difference probably represents variants in segregation-valid regions that are sequencing false positives. The stringency of segregation filtering is determined by the number of “errors” tolerated by the filter. For instance, consider the following situation. Given a postulated autosomal recessive model, a pattern of variation for a family of four includes a consistent proband, one consistent sib, one consistent parent, and a second parent with missing data (e.g., a local sequencing failure). Should this variant be included or excluded? The rules used to answer that question will define the stringency of the filter.


Genome Studio was used for the high-density SNP-array analyses including the straightforward visualization of copy number variants and the more complex detection of recombination sites using Boolean rule-sets. The methods for the latter types of analyses are provided in the articles referenced above.

We decided to obtain exome sequence on multiple family members. The decision was based on several factors: (1) There was no evidence of consanguinity or potential for homozygosity mapping based on previously obtained SNP array data; (2) there were no clinical findings to suggest a specific set of genes implicated in disease causation (that had not been excluded by clinical testing); and (3) there was no linkage region or other mapping data to establish a genomic candidate region. We therefore chose the most powerful approach for agnostic screening of the exome and sequenced both parents and one unaffected sibling along with the proband.

Recombination mapping was carried out using the methods described in Markello et al. [Markello et al., 2011b]. The procedure involves using Genome Studio to apply a set of Boolean segregation rules to SNP array data. The resulting recombination map was defined in a BED-formatted file (Linkage File.txt). The BED file was applied using VarSifter and reduced the candidate number from 1469 to 958 (compexome_36_linkage_regions.vs).

Our candidate list includes specific annotations regarding Mendelian consistency. Custom scripts use family-relationship data to test whether a given variant did or did not segregate in a biologically feasible manner and flag it as inconsistent if it did not. Furthermore, regions defined by gene boundaries are surveyed for pairs of variants that could make up a compound heterozygote set. Such variants are annotated with a column that lists the index number(s) of the complementing variant or variants (Nancy Hansen, unpublished data). The Mendelian consistency annotations may not be available in candidate lists from all sequencing centers. The VAR-MD [Sincan et al, 2012] and VAAST [Yandell et al., 2011] programs can incorporate such information. However, once the variant list gets short enough, a spreadsheet can be used to sort the variants by gene name. Once sorted, the contents of individual loci can be inspected for Mendelian relationships.

For our candidate list, we postulated an autosomal recessive genetic model because both parents were unaffected. A new dominant model would also be appropriate for a potential subsequent analysis. The recessive inheritance could arise from homozygous or compound heterozygous mutations. Application of the appropriate filters with VarSifter results in 7 homozygote candidates (compexome_37a_homozygous_recessive.vs) and 94 compound heterozygote candidates (compexome_37b_compound_heterozygotes.vs).

Working With the Candidate List: Assessment of Individual Variants


Inspection of the example files show that the candidate list is now small enough for each variant to be considered individually for goodness of fit with the clinical syndrome. Additional tools become useful at this stage. Individual variant positions should be looked up in any available databases of known genomic variants. Homozygous variant positions should be compared with the positions of any regions of homozygosity identified by SNP array analysis or other means. Apparently homozygous variants should also be correlated with any single copy deletions, to see if the two might combine to cause an autosomal recessive disease.


In our example, each homozygous variant is associated with a dbSNP “rs” number, providing an additional source of information. Individual variants may require in depth research. For example, among the homozygotes is a p.A34E mutation in the PPT2 gene, dbSNP number rs3096696. The coverage is low for the mother and the proband at 14 reads (compare with other variations in the list with coverage in the 50 to 200 range). Inspection of the dbSNP record reveals that the variation has been seen in homozygous form in 19% of 39 cell lines derived from persons of Caucasoid, African-American or American Indian ethnicities, 28 out of 39 of whom had known consanguinity. The SNPs were reported by a researcher at the Fred Hutchinson Cancer Research Center and contact information is available. Inspection of the Online Mendelian Inheritance in Man web site (OMIM, shows that PPT2 (MIM# 603298) has a known mouse model with a neurological phenotype. In addition to the dbSNP record, the laboratory that performed the HTS data acquisition should be able to inspect the raw alignment data to see if the variant is in an area consistent with genotyping errors. For the example case, similar research was able to deprioritize all of the homozygous variations.

The compound heterozygote list has 94 individual variants. Compound heterozygotes must have at least two pathogenic, trans-oriented mutations to satisfy an autosomal recessive model. A study of the specifics of the annotation of our candidate list provides an example of how knowledge about each step of data production is critical for interpretation.

First, for our list, family-based Mendelian-consistency annotation is carried out before the final quality-based variant exclusions are decided. As a result, some variants are removed from the dataset after compound heterozygosity variant pairings are established.

Second, an individual variant can be inherited in a manner consistent with a compound heterozygote model, but never have had a second mutation to complement it.

Third, and as a corollary to the second item, multiple variants at one locus may not be pairable if they all occur on the same allele.

Fourth, a pair of trans-oriented variants at a given site may have one good candidate and one poor candidate (poor quality, low pathogencity prediction and/or known benign changes based on literature or other information).

As a result of these four factors, the list of compound heterozygotes includes numerous variants that are annotated as consistent with compound heterozygous inheritance, but can be excluded due to lack of a second, high quality, trans variant. As mentioned in a previous section, part of the NISC annotation pipeline attempts to find variant pairs that together would explain compound heterozygous inheritance. VarSifter will display consistent pairings, and the example data set reveals only 3 pairs of variants (out of the original 94 individual variants).

Working With the Candidate List: Pathogenicity Assessment


Pathogenicty prediction estimates the effect a DNA variation will have on gene function. It is not unique to HTS and is frequently incorporated into the analysis of unknown sequence variants from other sources. Most of the available automated tools focus on the alteration of amino acids in coding regions. However, specialized tools are available to predict the affect of non-coding variants in splicing [Brunak et al., 1991; Desmet et al., 2009; Hebsgaard et al., 1996; Pertea et al., 2001] and regulatory regions [Venter and Warnich, 2009].

The criteria used to assess the pathogenicity of missense mutations include intraspecies conservation, information about protein structure (predicted and experimental), amino acid chemical similarity, coincidence with disease and functional assay. Many related software programs exist including Polyphen [Adzhubei et al., 2010], SIFT [Ng and Henikoff, 2001], Panther [Mi et al., 2005; Thomas et al., 2003], SNAP [Bromberg and Rost, 2007; Bromberg and Rost, 2008; Bromberg et al., 2008] and others. In general, pathogenicity prediction has false positive and false negative rates between 10% and 20% [Ng and Henikoff, 2006]. As a result of these substantial error rates, the predictions are primarily useful for prioritizing variation candidates and must be used with caution in assessing individual variants. When choosing a pathogenicity prediction software program, there are several features to consider beyond ease of use and convention. The optimal program would have the following characteristics: (1) the criteria by which individual predictions are made should be accessible to the user; (2) programs should provide results for a wide variety of regions, but should also reflect the paucity or abundance of information for a given site; and (3) the software should produce a reasonably variable numeric score to allow the prioritization of a long list of variants.


The UDP data sets are analyzed with CDpred, a component of the NISC annotation pipeline [Johnston et al., 2010]. CDPred estimates variant pathogenicity using alignment conservation data from the Conserved Domain Database [Marchler-Bauer et al., 2003]. When conserved domain alignments cannot be made, the program defaults to a BLOSUM matrix based on empirically derived substitution frequencies [Henikoff and Henikoff, 1992]. Increasingly positive and negative integers indicate decreasing and increasing pathogenicity, respectively. Stop mutations and canonical splice site mutations are arbitrarily set at −30, a value more negative than that seen for any missense mutation.

The example data set contains three compound heterozygote pairs, and the CDPred scores for each can be inspected to get a sense of how severe the mutations are. One pair has positive CDPred scores, suggestive of relatively mild effect on gene function. The other two pairs have negative scores, which are more consistent with a disease-causing mutation. All of the mutations are missense, so there are no very low scores such as the −30 seen for a stop mutation. As mentioned, using pathogenicity prediction software for individual variants is risky, and for a list this small it would mainly be used to get a general sense of mutation severity. However, in less favorable cases, there may be a long list of variants, and sorting by pathogenicity is a useful way to focus on a subset of data for initial analysis.

Working With the Candidate List: Previously Reported Mutations


Numerous software tools are available to search for associations between candidate variants and known clinical syndromes. Examples include the Online Mendelian Inheritance in Man (OMIM) web site, Pubmed, the Human Gene Mutation Database [Stenson, et al., 2008], disease-specific mutation repositories, and software such as Alamut (Interactive Biosoftware, Rouen, France) that can collate information from multiple sources. Many sequence variants listed as pathogenic will not have been adequately characterized, so care must be taken when assigning disease causation.


One of the compound heterozygotes genes is GLB1, and one of the variants in the pair, p.R201H, has been reported as being associated with GM1 gangliosidosis. That information supports the hypothesis that GLB1 is the disease-causing gene.

Working With the Candidate List: Incorporation of Preexisting Information


Preexisting knowledge about the biology or genetics of a particular project can be added at any stage of analysis. If there is strong evidence that the causal variation(s) will be present in a specific chromosomal region, a targeted capture technique may be preferable to exome capture. Targeted capture will genotype a wider range of potential non-coding regulatory sites and intronic sequence in the region of interest. In either case, the candidate list can be narrowed by creating a BED file that defines specific regions of interest. Alternatively, a list of genes located in a candidate region can be specified.

Erlich et al. [Erlich et al., 2011] reported an approach by which a candidate gene list was narrowed using disease network analysis. The approach is suited for cases where the syndrome being studied shows genetic heterogeneity. Genes known to cause the syndrome are inspected for commonalities in physical structure, expression patterns, shared domains, etc. Identified shared characteristics are then searched for in genes with variants in the whole exome candidate list. Several software tools available for network analysis are reviewed in the article by Erlich et al.

Individual clinical hypotheses can be studied by making gene lists associated with specific syndromes. For example a syndrome that has characteristics consistent with a mitochondrial disease would make use of a gene list of all known nuclear-encoded mitochondrial genes.


We have developed and/or adapted a number of disease gene lists for conditions known to exhibit significant genetic heterogeneity. For example, many of the UDP participants have medical syndromes that could be caused by mitochondrial disease. However, there are a large number of nuclear-encoded mitochondrial genes, and testing them by individual Sanger sequencing is time consuming and expensive. Exome sequencing provides an alternate approach to such diseases, and can be used to sequence all of the genes of interest within the limits of exome sequencing coverage characteristics. The exome candidate list is then examined by looking only at genes known to be associated with mitochondrial disease.

Unfortunately, in the example case, the clinical phenotype did not match any of those conditions. Inspection of the high-density SNP arrays did not show any genomic lesions that would provide a candidate locus or basis for targeted capture. However, the clinical presentation did have features consistent with a neurological lysosomal storage disease (LSD), including progressive symptoms.

Working With the Candidate List: Variant Validation


Once variants are detected by an HTS technique, it is standard practice to validate candidates of interest using Sanger sequencing. For our sequencing collaborators, approximately 90% of HTS detected variants will validate. For variants that are not well supported by previous work, functional analysis must be the final determinant of the pathogenic role of a DNA sequence variant.


Given a clinical syndrome suggestive of an LSD,and knowing that GLB1 causes GM1 gangliosidosis (a type of LSD), we performed in-house Sanger sequencing followed by CLIA laboratory sequencing to verify the HTS sequencing variants detected in the GLB1 gene. We then repeated prior clinical enzymatic testing that had ruled out GM1 gangliosidosis. It turned out that the prior negative testing had been a false-negative result and that the beta-galactosidase (GLB1 gene product) enzymatic-activity level was indeed reduced to a level consistent with disease [Pierson et al., 2011].


  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Discussion
  6. Acknowledgements
  7. References
  8. Supporting Information

We have presented a framework for the analysis of HTS data, starting with a large list of annotated DNA sequence variants, and ending with a small list of high-value candidates for confirmation or research follow-up. We have further provided an example data set that includes a validated result. During the introduction, we reported that we have found unequivocal disease-causing mutations in 5 of 30 families, plus several other compelling candidate DNA variants. Our record illustrates that many HTS projects do not generate a clear-cut answer during the early stages of analysis. In some cases, relaxing filtering constraints will expand the list to include an obvious candidate. In other cases, the result is a list of weak candidates that require significant laboratory work to definitively exclude or confirm.

When considering HTS failure, it is important to consider the types of variants not detected by the methodology. The default/first-pass scheme outlined for our example will not detect pathogenic synonymous mutations, non-cannonical splice site mutations, or common mutations that cause disease in certain circumstances (e.g., the common MTHFR c.677C > T mutation in the setting of folate deficiency). For our first-pass analysis, such variants are missed because of high filtration stringency. In other cases, the data acquisition methodology fails to genotype the variant of interest. Examples include regions not included in the capture design, and regions that are not well sequenced despite being captured. The latter is often due to local sequence characteristics including repetitive DNA elements and/or low-sequence complexity. The analyst must know how the capture technique is designed (what would be captured under optimal conditions), how the capture methods perform in practice (what regions are actually captured by a given lab, chemistry and procedure), and how well individual regions are sequenced once captured.

Failure of HTS projects may be caused by incorrect hypotheses regarding the genetic mode of inheritance. In our example, the pedigree was consistent with several genetic/segregation mechanisms including autosomal recessive and new dominant. We do not discuss more complex models such as multiallelic inheritance. Filtering based on a multiallelic model can be attempted for individual hypotheses by viewing variants only from genes belonging to a specific pathway. Automated analysis is also possible, but requires bioinformatics procedures that are outside the scope of this article.

The provided example dataset included exome sequence for a small family. Other potential datasets include whole genomes, exomes from single individuals and custom capture of a genomic candidate region. Different specialized methods, and/or additional analyses, would be better suited to detect gene–gene interactions and epigenetic phenomena [Feng et al., 2011]. The techniques outlined for our example have variable application to other datasets. Some elements, such as population frequency and kill-list filtering, are applicable to single exomes and subsets of whole genome sequence.

For single exomes, some filters will not be usable due to the lack of family structure data. However, with sufficient other clues, such as candidate regions or gene lists, the remaining filters may be adequate to find the causative variant(s). A special case of single exome sequencing is the simultaneous clinical testing of a large set of genes. An example of a disease for which such a technique might be advantageous is spinocerebellar ataxia (SCA). SCA is a neurologic syndrome comprising multiple overlapping diseases caused by multiple different genes [Matilla-Duenas et al., 2010]. Given the current ∼$1,000 cost for an exome, it is enticing to consider using whole exome sequencing to screen the relevant genes for variants instead of paying the tens-of-thousands of dollars needed to screen the same genes using commercial clinical sequencing. However, several caveats need to be considered. First, several of the SCAs are caused by repeat expansions, a type of genetic lesion that is not reliably detected by current HTS methods, especially exome sequencing. Second, exome data from any given laboratory must be carefully studied to determine how well the SCA genes are captured, covered, and genotyped. Third, while exome sequencing is a convenient way to survey a group of genes for variants, it is more difficult to determine how well regaining regions have excluded variants. In other words, the pattern of false negative results is less well understood for HTS than for Sanger sequencing. Dias et al. looked at coverage for a group of genes known to be associated with inherited neurologic disease [Dias et al., 2012]. Surprisingly, although most positions were well covered/genotyped in most individuals, there were no genes that were covered adequately in all individuals. Furthermore, the pattern of “missed” sites was distributed among the sequenced individuals rather than being concentrated in a few “bad reads.” These data suggest that we have work to do in understanding the false negative characteristics of exome sequencing.

Whole genome data will eventually replace exome data as prices fall and the significance of conserved intragenic chromosomal regions are characterized [Margulies and Birney, 2008]. Genomic DNA will provide the genome-wide information we currently obtain from high-density SNP arrays. In addition, there is the potential for overall greater coverage of genes of interest, particularly non-coding regions that are missed by current exome HTS. Methods to filter conserved, non-coding regions of the genome are being developed, but are not yet widely available

HTS represents a fundamental advance in how genomic sequence data are measured, and has fundamental implications for both research and clinical work. We believe that interpretation of exome data must involve clinicians and scientists familiar with the subjects of the study and should not rely solely on bioinformatics specialists. Using this model, individual researchers need to be empowered to work directly with exome data. An understanding of the fundamental underpinnings of HTS data manipulation and processing will provide the means for that empowerment.


  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Discussion
  6. Acknowledgements
  7. References
  8. Supporting Information

We thank our patients and their families, who are partners in the pursuits of the NIH UDP. The clinical work to which we apply these methods would not be possible without the outstanding clinical nurse practitioners, research nurses, genetic counselors, consultants, and other providers with whom we work. We appreciate the excellent technical skills of Roxanne Fischer and Richard Hess.


  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Discussion
  6. Acknowledgements
  7. References
  8. Supporting Information
  • Online Mendelian Inheritance in Man, OMIM®. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD), November 1st, 2011. World Wide Web URL:
  • Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. 2010. A method and server for predicting damaging missense mutations. Nat Methods 7:248249.
  • Bilgüvar K, Oztürk AK, Louvi A, Kwan KY, Choi M, Tatli B, Yalnizoğlu D, Tüysüz B, Cağlayan AO, Gökben S, Kaymakçalan H, Barak T, Bakircioğlu M, Yasuno K, Ho W, Sanders S, Zhu Y, Yilmaz S, Dinçer A, Johnson MH, Bronen RA, Koçer N, Per H, Mane S, Pamir MN, Yalçinkaya C, Kumandaş S, Topçu M, Ozmen M, Sestan N, Lifton RP, State MW, Günel M. 2010. Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations. Nature 467:207210.
  • Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. 2010. Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol Chapter 19:Unit 19 10 121.
  • Bonnefond A, Durand E, Sand O, De Graeve F, Gallina S, Busiah K, Lobbens S, Simon A, Bellanné-Chantelot C, Létourneau L, Scharfmann R, Delplanque J, Sladek R, Polak M, Vaxillaire M, Froguel P. 2010. Molecular diagnosis of neonatal diabetes mellitus using next-generation sequencing of the whole exome. PLoS One 5:e13630, 1–5.
  • Brkanac Z, Spencer D, Shendure J, Robertson PD, Matsushita M, Vu T, Bird TD, Olson MV, Raskind WH. 2009. IFRD1 is a candidate gene for SMNA on chromosome 7q22-q23. Am J Hum Genet 84:692697.
  • Bromberg Y, Rost B. 2007. SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res 35:38233835.
  • Bromberg Y, Rost B. 2008. Comprehensive in silico mutagenesis highlights functionally important residues in proteins. Bioinformatics 24:i20712.
  • Bromberg Y, Yachdav G, Rost B. 2008. SNAP predicts effect of mutations on protein function. Bioinformatics 24:23972398.
  • Brunak S, Engelbrecht J, Knudsen S. 1991. Prediction of human mRNA donor and acceptor sites from the DNA sequence. J Mol Biol 220:4965.
  • Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, Nayir A, Bakkaloğlu A, Ozen S, Sanjad S, Nelson-Williams C, Farhi A, Mane S, Lifton RP. 2009. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci USA 106:1909619101.
  • Desmet FO, Hamroun D, Lalande M, Collod-Beroud G, Claustres M, Beroud C. 2009. Human Splicing Finder: an online bioinformatics tool to predict splicing signals. Nucleic Acids Res 37:e67.
  • Dias C, Sincan M, Rupps R, Briemberg H, Selby K, Mullikin J, Markello T, Adams D, Gahl WA, Boerkoel CF. 2011. Exome sequencing: diagnosis of genetically heterogeneous neuromuscular disorders. Hum Mutat 33. In Press.
  • Erlich Y, Edvardson S, Hodges E, Zenvirt S, Thekkat P, Shaag A, Dor T, Hannon GJ, Elpeleg O. 2011. Exome sequencing and disease-network analysis of a single family implicate a mutation in KIF1A in hereditary spastic paraparesis. Genome Res 21:658664.
  • Feng S, Rubbi L, Jacobsen SE, Pellegrini M. 2011. Determining DNA methylation profiles using sequencing. Methods Mol Biol 733:223238.
  • Fuentes Fajardo KV, Adams D, NISC Comparative Sequencing Program, Mason CE, Sincan M, Tifft C, Toro C, Boerkoel CF, Gahl W, Markello T. 2012. Detecting false positive signals in exome sequencing. Hum Mutat 33. In Press.
  • Gahl WA, Markello TC, Toro C, Fajardo KF, Sincan M, Gill F, Carlson-Donohoe H, Gropman A, Pierson TM, Golas G, Wolfe L, Groden C, Godfrey R, Nehrebecky M, Wahl C, Landis DM, Yang S, Madeo A, Mullikin JC, Boerkoel CF, Tifft CJ, Adams D. 2012. The National Institutes of Health Undiagnosed Diseases Program: insights into rare diseases. Genet Med 14(1):5159.
  • Gahl WA, Tifft CJ. 2011. The NIH Undiagnosed Diseases Program: lessons learned. JAMA 305:19041905.
  • Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. 2005. Galaxy: a platform for interactive large-scale genome analysis. Genome Res 15:14511455.
  • Goecks J, Nekrutenko A, Taylor J. 2010. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11:R86.
  • Gonzalez JR, Rodriguez-Santiago B, Caceres A, Pique-Regi R, Rothman N, Chanock SJ, Armengol L, Perez-Jurado LA. 2011. A fast and accurate method to detect allelic genomic imbalances underlying mosaic rearrangements using SNP array data. BMC Bioinformatics 12:166.
  • Hebsgaard SM, Korning PG, Tolstrup N, Engelbrecht J, Rouze P, Brunak S. 1996. Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucleic Acids Res 24:34393452.
  • Henikoff S, Henikoff JG. 1992. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89:1091510919.
  • Hoischen A, van Bon BW, Gilissen C, Arts P, van Lier B, Steehouwer M, de Vries P, de Reuver R, Wieskamp N, Mortier G, Devriendt K, Amorim MZ, Revencu N, Kidd A, Barbosa M, Turner A, Smith J, Oley C, Henderson A, Hayes IM, Thompson EM, Brunner HG, de Vries BB, Veltman JA. 2010. De novo mutations of SETBP1 cause Schinzel-Giedion syndrome. Nat Genet 42:483485.
  • Johnston JJ, Teer JK, Cherukuri PF, Hansen NF, Loftus SK, Chong K, Mullikin JC, Biesecker LG. 2010. Massively parallel sequencing of exons on the X chromosome identifies RBM10 as the gene that causes a syndromic form of cleft palate. Am J Hum Genet 86:743748.
  • Kahrizi K, Hu CH, Garshasbi M, Abedini SS, Ghadami S, Kariminejad R, Ullmann R, Chen W, Ropers HH, Kuss AW, Najmabadi H, Tzschach A. 2011. Next generation sequencing in a family with autosomal recessive Kahrizi syndrome (OMIM 612713) reveals a homozygous frameshift mutation in SRD5A3. Eur J Hum Genet 19:115117.
  • Kalay E, Yigit G, Aslan Y, Brown KE, Pohl E, Bicknell LS, Kayserili H, Li Y, Tüysüz B, Nürnberg G, Kiess W, Koegl M, Baessmann I, Buruk K, Toraman B, Kayipmaz S, Kul S, Ikbal M, Turner DJ, Taylor MS, Aerts J, Scott C, Milstein K, Dollfus H, Wieczorek D, Brunner HG, Hurles M, Jackson AP, Rauch A, Nürnberg P, Karagüzel A, Wollnik B. 2011. CEP152 is a genome maintenance protein disrupted in Seckel syndrome. Nat Genet 43:2326.
  • Klein CJ, Botuyan MV, Wu Y, Ward CJ, Nicholson GA, Hammans S, Hojo K, Yamanishi H, Karpf AR, Wallace DC, Simon M, Lander C, Boardman LA, Cunningham JM, Smith GE, Litchy WJ, Boes B, Atkinson EJ, Middha S, B Dyck PJ, Parisi JE, Mer G, Smith DI, Dyck PJ. 2011. Mutations in DNMT1 cause hereditary sensory neuropathy with dementia and hearing loss. Nat Genet 43:595600.
  • Krawitz PM, Schweiger MR, Rödelsperger C, Marcelis C, Kölsch U, Meisel C, Stephani F, Kinoshita T, Murakami Y, Bauer S, Isau M, Fischer A, Dahl A, Kerick M, Hecht J, Köhler S, Jäger M, Grünhagen J, de Condor BJ, Doelken S, Brunner HG, Meinecke P, Passarge E, Thompson MD, Cole DE, Horn D, Roscioli T, Mundlos S, Robinson PN. 2010. Identity-by-descent filtering of exome sequence data identifies PIGV mutations in hyperphosphatasia mental retardation syndrome. Nat Genet 42:827829.
  • Lalonde E, Albrecht S, Ha KC, Jacob K, Bolduc N, Polychronakos C, Dechelotte P, Majewski J, Jabado N. 2010. Unexpected allelic heterogeneity and spectrum of mutations in Fowler syndrome revealed by next-generation exome sequencing. Hum Mutat 31:918923.
  • Ledergerber C, Dessimoz C. 2011. Base-calling for next-generation sequencing platforms. Brief Bioinformatics 12:489497.
  • Lin Y, Li J, Shen H, Zhang L, Papasian CJ, Deng HW. 2011. Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics 27:20312037.
  • Lupski JR, Reid JG, Gonzaga-Jauregui C, Rio Deiros D, Chen DC, Nazareth L, Bainbridge M, Dinh H, Jing C, Wheeler DA, McGuire AL, Zhang F, Stankiewicz P, Halperin JJ, Yang C, Gehman C, Guo D, Irikat RK, Tom W, Fantin NJ, Muzny DM, Gibbs RA. 2010. Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. N Engl J Med 362:11811191.
  • Marchler-Bauer A, Anderson JB, DeWeese-Scott C, Fedorova ND, Geer LY, He S, Hurwitz DI, Jackson JD, Jacobs AR, Lanczycki CJ, Liebert CA, Liu C, Madej T, Marchler GH, Mazumder R, Nikolskaya AN, Panchenko AR, Rao BS, Shoemaker BA, Simonyan V, Song JS, Thiessen PA, Vasudevan S, Wang Y, Yamashita RA, Yin JJ, Bryant SH. 2003. CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res 31:383387.
  • Margulies EH, Birney E. 2008. Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes. Nat Rev Genet 9:303313.
  • Markello TC, Carlson-Donohoe H, Sincan M, Adams D, Bodine DM, Farrar JE, Vlachos A, Lipton JM, Auerbach AD, Ostrander EA, Chandrasekharappa SC, Boerkoel CF, Gahl WA. 2011. Sensitive quantification of mosaicism using high density SNP arrays and the cumulative distribution function. Mol Genet Metab [Epub ahead of print].
  • Markello TC, Han T, Carlson-Donohoe H, Ahaghotu C, Harper U, Jones M, Chandrasekharappa S, Anikster Y, Adams DR, Nisc Comparative Sequencing Program, Gahl WA, Boerkoel CF. 2011. Recombination mapping using Boolean logic and high-density SNP genotyping for exome sequence filtering. Mol Genet Metab [Epub ahead of print].
  • Matilla-Duenas A, Sanchez I, Corral-Juan M, Davalos A, Alvarez R, Latorre P. 2010. Cellular and molecular pathways triggering neurodegeneration in the spinocerebellar ataxias. Cerebellum 9:148166.
  • McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. 2010. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:12971303.
  • Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell MJ, Kitano H, Thomas PD. 2005. The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res 33(Database issue):D284D288.
  • Miller JR, Koren S, Sutton G. 2010. Assembly algorithms for next-generation sequencing data. Genomics 95:315327.
  • Ng PC, Henikoff S. 2001. Predicting deleterious amino acid substitutions. Genome Res 11:863874.
  • Ng PC, Henikoff S. 2006. Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet 7:6180.
  • Ng SB, Bigham AW, Buckingham KJ, Hannibal MC, McMillin MJ, Gildersleeve HI, Beck AE, Tabor HK, Cooper GM, Mefford HC, Lee C, Turner EH, Smith JD, Rieder MJ, Yoshiura K, Matsumoto N, Ohta T, Niikawa N, Nickerson DA, Bamshad MJ, Shendure J. 2010a. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet 42:790793.
  • Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA, Shendure J, Bamshad MJ. 2010b. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet 42:3035.
  • Nikopoulos K, Gilissen C, Hoischen A, van Nouhuys CE, Boonstra FN, Blokland EA, Arts P, Wieskamp N, Strom TM, Ayuso C, Tilanus MA, Bouwhuis S, Mukhopadhyay A, Scheffer H, Hoefsloot LH, Veltman JA, Cremers FP, Collin RW. 2010. Next-generation sequencing of a 40 Mb linkage interval reveals TSPAN12 mutations in patients with familial exudative vitreoretinopathy. Am J Hum Genet 86:240247.
  • Pertea M, Lin X, Salzberg SL. 2001. GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res 29:11851190.
  • Pierson TM, Adams DA, Markello T, Golas F, Yang S, Sincan M, Simeonov DR, Fuentes-Fajardo K, Hansen NF, Cherukuri PF, Cruz P, Teer JK, Mullikin JC, Boerkoel CF, Gahl WA, Tifft CJ. 2011. Exome sequencing as a diagnostic tool in a case of undiagnosed juvenile-onset GM1-gangliosidosis. Neurology. In Press.
  • Puente XS, Quesada V, Osorio FG, Cabanillas R, Cadiñanos J, Fraile JM, Ordóñez GR, Puente DA, Gutiérrez-Fernández A, Fanjul-Fernández M, Lévy N, Freije JM, López-Otín C. 2011. Exome Sequencing and Functional Analysis Identifies BANF1 Mutation as the Cause of a Hereditary Progeroid Syndrome. Am J Hum Genet 88:650656.
  • Rehman AU, Morell RJ, Belyantseva IA, Khan SY, Boger ET, Shahzad M, Ahmed ZM, Riazuddin S, Khan SN, Friedman TB. 2010. Targeted capture and next-generation sequencing identifies C9orf75, encoding taperin, as the mutated gene in nonsyndromic deafness DFNB79. Am J Hum Genet 86:378388.
  • Rios J, Stein E, Shendure J, Hobbs HH, Cohen JC. 2010. Identification by whole-genome resequencing of gene defect responsible for severe hypercholesterolemia. Hum Mol Genet 19:43134318.
  • Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT, Rowen L, Pant KP, Goodman N, Bamshad M, Shendure J, Drmanac R, Jorde LB, Hood L, Galas DJ. 2010. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328:636639.
  • Schatz MC, Delcher AL, Salzberg SL. 2010. Assembly of large genomes using second-generation sequencing. Genome Res 20:11651173.
  • Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. 2001. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29:308311.
  • Simpson MA, Irving MD, Asilmaz E, Gray MJ, Dafou D, Elmslie FV, Mansour S, Holder SE, Brain CE, Burton BK, Kim KH, Pauli RM, Aftimos S, Stewart H, Kim CA, Holder-Espinasse M, Robertson SP, Drake WM, Trembath RC. 2011. Mutations in NOTCH2 cause Hajdu-Cheney syndrome, a disorder of severe and progressive bone loss. Nat Genet 43:303305.
  • Sobreira NL, Cirulli ET, Avramopoulos D, Wohler E, Oswald GL, Stevens EL, Ge D, Shianna KV, Smith JP, Maia JM, Gumbs CE, Pevsner J, Thomas G, Valle D, Hoover-Fong JE, Goldstein DB. 2010. Whole-genome sequencing of a single proband together with linkage analysis identifies a Mendelian disease gene. PLoS Genet 6:e1000991.
  • Stenson PD, Mort M, Ball EV, Howells K, Phillips AD, Thomas NS, Cooper DN. 2009. The Human Gene Mutation Database: 2008 update. Genome Med 1(1):13.
  • Sudmant PH, Kitzman JO, Antonacci F, Alkan C, Malig M, Tsalenko A, Sampas N, Bruhn L, Shendure J, Eichler EE. 2010. Diversity of human copy number variation and multicopy genes. Science 330:641646.
  • Summerer D, Schracke N, Wu H, Cheng Y, Bau S, Stahler CF, Stahler PF, Beier M. 2010. Targeted high throughput sequencing of a cancer-related exome subset by specific sequence capture with a fully automated microarray platform. Genomics 95:241246.
  • Sincan M, Simeonov D, Adams D, Markello TC, Pierson T, Toro C, Gahl WA, Boerkoel CF. 2012. VAR-MD: A tool to analyze whole exome/genome variants in small human pedigrees with Mendelian inheritance. Hum Mutat 33.
  • Taylor J, Schenck I, Blankenberg D, Nekrutenko A. 2007. Using galaxy to perform large-scale interactive data analyses. Curr Protoc Bioinformatics Chapter 10:Unit 10 5.
  • Teer JK, Bonnycastle LL, Chines PS, Hansen NF, Aoyama N, Swift AJ, Abaan HO, Albert TJ; NISC Comparative Sequencing Program, Margulies EH, Green ED, Collins FS, Mullikin JC, Biesecker LG. 2010. Systematic comparison of three genomic enrichment methods for massively parallel DNA sequencing. Genome Res 20:14201431.
  • Teer JK, Green ED, Mullikin JC, Biesecker LG. 2011. VarSifter: Visualizing and analyzing exome-scale sequence variation data on a desktop computer. Bioinformatics. Epub ahead of print.
  • Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A. 2003. PANTHER: a library of protein families and subfamilies indexed by function. Genome Res 13:21292141.
  • Venter M, Warnich L. 2009. In silico promoters: modelling of cis-regulatory context facilitates target predictio. J Cell Mol Med 13:270278.
  • Volpi L, Roversi G, Colombo EA, Leijsten N, Concolino D, Calabria A, Mencarelli MA, Fimiani M, Macciardi F, Pfundt R, Schoenmakers EF, Larizza L. 2010. Targeted next-generation sequencing appoints c16orf57 as clericuzio-type poikiloderma with neutropenia gene. Am J Hum Genet 86:7276.
  • Walsh T, Shahin H, Elkan-Miller T, Lee MK, Thornton AM, Roeb W, Abu Rayyan A, Loulus S, Avraham KB, King MC, Kanaan M. 2010. Whole exome sequencing and homozygosity mapping identify mutation in the cell polarity protein GPSM2 as the cause of nonsyndromic hearing loss DFNB82. Am J Hum Genet 87:9094.
  • Wei X, Walia V, Lin JC, Teer JK, Prickett TD, Gartner J, Davis S; NISC Comparative Sequencing Program, Stemke-Hale K, Davies MA, Gershenwald JE, Robinson W, Robinson S, Rosenberg SA, Samuels Y. 2011. Exome sequencing identifies GRIN2A as frequently mutated in melanoma. Nat Genet 43:442446.
  • Won HH, Kim HJ, Lee KA, Kim JW. 2008. Cataloging coding sequence variations in human genome databases. PLoS One 3:e3575.
  • Worthey EA, Mayer AN, Syverson GD, Helbling D, Bonacci BB, Decker B, Serpe JM, Dasu T, Tschannen MR, Veith RL, Basehore MJ, Broeckel U, Tomita-Mitchell A, Arca MJ, Casper JT, Margolis DA, Bick DP, Hessner MJ, Routes JM, Verbsky JW, Jacob HJ, Dimmock DP. 2011. Making a definitive diagnosis: successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease. Genet Med 13:255262.
  • Yandell M, Huff C, Hu H, Singleton M, Moore B, Xing J, Jorde LB, Reese MG. 2011. A probabilistic disease-gene finder for personal genomes. Genome research 21:15291542.

Supporting Information

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Discussion
  6. Acknowledgements
  7. References
  8. Supporting Information

Additional Supporting information may be found in the online version of this article

humu_22035_sm_SuppInfo.pdf43KSupporting Information

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.