SEARCH

SEARCH BY CITATION

Keywords:

  • bioinformatics;
  • disease gene;
  • exome;
  • genome;
  • next-generation sequencing

Abstract

  1. Top of page
  2. Abstract
  3. Conflicts of interest
  4. Whole-exome sequencing
  5. WES strategies
  6. Intersection filtering
  7. De novo mutations
  8. Family-based filtering strategies: homozygosity mapping and linkage approaches
  9. IBD inference from WES/WGS data
  10. Conclusions
  11. Acknowledgement
  12. References

Robinson PN, Krawitz P, Mundlos S. Strategies for exome and genome sequence data analysis in disease-gene discovery projects.

In whole-exome sequencing (WES), target capture methods are used to enrich the sequences of the coding regions of genes from fragmented total genomic DNA, followed by massively parallel, ‘next-generation’ sequencing of the captured fragments. Since its introduction in 2009, WES has been successfully used in several disease-gene discovery projects, but the analysis of whole-exome sequence data can be challenging. In this overview, we present a summary of the main computational strategies that have been applied to identify novel disease genes in whole-exome data, including intersect filters, the search for de novo mutations, and the application of linkage mapping or inference of identity-by-descent (IBD) in family studies.


The identification of Mendelian disease genes has long been a major focus of human genetics. Until recently, most efforts at disease-gene identification involved positional cloning, that is, linkage analysis to identify a genomic interval usually spanning approximately 0.5–10 cM and containing up to about 300 genes. Sequencing large numbers of genes was time-consuming and expensive, and international efforts based primarily on positional cloning strategies had identified less than 2000 disease genes by the year 2009 (corresponding to something less than 4000 diseases, because some genes are associated with multiple diseases). A large number of Mendelian diseases remain for which no disease gene has been identified yet, and presumably there are many more unnamed monogenic diseases that will be found in coming years.

The initial demonstration by Sarah Ng and colleagues that whole-exome sequencing (WES) can be used to identify disease genes in 2009 (1) can probably be regarded as the beginning of a revolution in human genetics and many reports of novel disease genes discovered using WES have been published in the subsequent 2 years. However, WES is not a panacea and researchers need to carefully consider how to design WES experiments for disease-gene discovery to avoid frustration. This article will review the major strategies that have been applied to discover novel disease genes with WES and discuss some of the pitfalls of the methodology.

Whole-exome sequencing

  1. Top of page
  2. Abstract
  3. Conflicts of interest
  4. Whole-exome sequencing
  5. WES strategies
  6. Intersection filtering
  7. De novo mutations
  8. Family-based filtering strategies: homozygosity mapping and linkage approaches
  9. IBD inference from WES/WGS data
  10. Conclusions
  11. Acknowledgement
  12. References

To date, the great majority of mutations identified in human hereditary diseases have been located in the coding sequences of genes. It seems possible that mutations in non-coding sequences are more common than is currently appreciated and have been rarely detected because of various technical and experimental biases. However, the fact that the great majority of disease-causing mutations characterized to date have been located in or around exons strongly suggested that it would be useful in disease-gene discovery projects to concentrate sequencing efforts on the approximately 1% of the human genome that codes for protein sequences to avoid the additional cost and complexity of whole-genome sequencing (WGS). However, as the costs of WGS continue to fall and our ability to interpret variation in non-coding sequences improves, it seems likely that WGS will replace WES in many settings.

Current methods for enriching exonic sequences all work in principle in the same fashion. Oligonucleotide probes are constructed to hybridize (to “capture”) the target sequences from fragmented total genomic DNA. Common linkers or adaptors are used as primers to amplify the target sequences in a single PCR reaction and the unwanted sequences are discarded. A number of companies are offering target capture methods for WES (2). The methods typically aim to capture all exonic and flanking sequences and may also include probes to target microRNA and other sequences of interest. Several reviews on the technical aspects of target capture methods have appeared recently (3–5). Current commercial offerings are compatible with the three major next-generation sequencing platforms by Illumina, Roche, and Applied Biosystems. The kits tend to comprise exons from the consensus coding sequence project (6), which currently comprises 176,266 exons from 18,409 genes, as well as additional sequences.

WES strategies

  1. Top of page
  2. Abstract
  3. Conflicts of interest
  4. Whole-exome sequencing
  5. WES strategies
  6. Intersection filtering
  7. De novo mutations
  8. Family-based filtering strategies: homozygosity mapping and linkage approaches
  9. IBD inference from WES/WGS data
  10. Conclusions
  11. Acknowledgement
  12. References

One of the main challenges for disease-gene discovery by WES lies in the sheer number of variants found in individual exomes. It has been reported that each genome carries 165 homozygous protein-truncating or stop loss variants in genes representing a diverse set of pathways (7). This means that the mere finding of a sequence variant that appears to be a pathogenic mutation cannot be taken as proof that the change is causally related to the disease being investigated, and integrative computational analysis that takes into account phenotypes, prioritizes sequence variants, and makes use of information from multiple databases is required to use the data from WES in a medical context. In addition, the choice of which samples to sequence and what type of bioinformatic analysis to apply will depend on the clinical situation. The following sections intend to provide an intuitive introduction to the relevant issues and pointers to the literature.

Intersection filtering

  1. Top of page
  2. Abstract
  3. Conflicts of interest
  4. Whole-exome sequencing
  5. WES strategies
  6. Intersection filtering
  7. De novo mutations
  8. Family-based filtering strategies: homozygosity mapping and linkage approaches
  9. IBD inference from WES/WGS data
  10. Conclusions
  11. Acknowledgement
  12. References

Although the numbers have varied between different publications, which presumably reflects differences in the technologies and analysis strategies, typically, an individual exome is found to have 20–30,000 variants as compared with the genomic reference sequence. Up to roughly 10,000 of these variants are predicted to lead to non-synonymous amino acid substitutions (missense mutations), alterations of conserved splice site residues, or represent small insertions or deletions (NS/SS/I). Depending on the ethnic background of the proband and other factors, up to about 90% of these variants can be found in databases of common variants such as dbSNP (8), the 1000 Genomes Project (9), and in-house exome databases. Based on the assumption that variants that are common in the population are not likely to be the cause of rare Mendelian diseases, such variants are typically filtered out before further analysis. Similarly, variants that are computationally predicted to be benign are typically removed from further analysis based on the results of algorithms that estimate the pathogenicity of missense and other variants (10–12). It should be noted that computational algorithms have high rates of false-positive and false-negative predictions (13, 14). Although it is difficult to give an exact numerical value, it is likely that the false-negative and false-positive rates are at least 20% for WES data.

This kind of filtering has been used successfully in several projects in which multiple individuals with a given disease were sequenced. Following removal of common variants and those not predicted to be pathogenic, only those genes are considered that show rare and potentially pathogenic sequence variants in all (or most) sequenced individuals (1, 15–17). For autosomal dominant disorders, each candidate gene must show at least one such change per individual, and for autosomal recessive disorders, candidate genes must have either homozygous or compound heterozygous mutations.

The assumption is that each exome or genome contains numerous sequence variants unrelated to the disease being studied and that are not removed by the filtering steps described above (thus, these variants can be regarded as false-positive calls). Under the assumption that these variants are distributed at random in the population, if we examine the intersection between a sufficient number of multiple unrelated individuals, only the disease gene itself will show mutations in all individuals. Imagine for the sake of argument that 5% of 20,000 targeted genes show rare, potentially pathogenic sequence variants in all individuals. If we sequence a single individual, then 1000 genes will remain as candidates after we filter as above. If we sequence a second individual and examine only those genes with variants in both individuals, then 5% of 1000 or 50 candidates will remain. After we sequence a third individual, less than one gene is predicted to have a variant in all three individuals just by chance and only the true disease gene will remain.

Of course, this strategy is highly susceptible to false-negative and false-positive results if applied naively. For instance, in a study on 10 individuals with Kabuki syndrome, the only gene that was found to have at least one NS/SS/I in all 10 sequenced individuals was the MUC16 gene, which codes for a protein with 22,152 amino acids that provides a protective, lubricating barrier against particles and infectious agents at mucosal surfaces. It soon became apparent that this was a false-positive result that might be related to the extremely large size of the coding sequence and the resultant higher chance of an unrelated sequence variant being present. On the other hand, it is possible that a mutation is located in a poorly covered exon and thus escapes detection (typically, a reasonable coverage can be achieved for up to about 90% of the sequenced exome using current targeting technologies; thus, if several of the mutations among the sequenced individuals are located in poorly covered exons, the candidate gene would falsely be removed from further consideration). Alternatively, a mutation might not be a typical NS/SS/I variant and thus might have been mistakenly removed. For instance, a mutation such as c.6354C>T, a silent mutation in exon 51 of the fibrillin-1 gene that induces exon skipping (18), would not be identified by current filtering strategies. In addition, although most point mutations in human hereditary disease identified to date have been located in or near exons, point mutations in distant enhancers and other regulatory elements have been associated with hereditary diseases (19), and such mutations would for the most part not be detectable using current enrichment strategies. Finally, genetically heterogeneous disorders can be missed by this approach, because different genes could be involved in individual patients of the study group.

De novo mutations

  1. Top of page
  2. Abstract
  3. Conflicts of interest
  4. Whole-exome sequencing
  5. WES strategies
  6. Intersection filtering
  7. De novo mutations
  8. Family-based filtering strategies: homozygosity mapping and linkage approaches
  9. IBD inference from WES/WGS data
  10. Conclusions
  11. Acknowledgement
  12. References

Many consultations in medical genetics clinics deal with isolated cases of mental retardation, multiple congenital anomalies, or other diseases. Unless an aetiological diagnosis can be made, it is not formally possible to know whether the manifestations in the patient are related to an autosomal recessive disorder, an oligogenic or otherwise multifactorial disease, environmental factors, or to a de novo (spontaneous) mutation. Recent results suggest that the role of de novo mutations in such situations may have been underappreciated. The per generation mutation rate in humans has been estimated at between 7.6 × 109 and 2.2× 108, or roughly one in a hundred million positions in the haploid genome, which corresponds to 0.86 de novo amino acid altering mutations per newborn (20, 21). Therefore, for diseases with a high degree of genetic heterogeneity, such as intellectual disability, it seemed reasonable to hypothesize that de novo mutations might be more common than previously believed and to use an analysis strategy in which case-parent trios are sequenced to identify potentially pathogenic, de novo changes in the exome sequences of the affected children (22, 23).

The pioneering work of Vissers and colleagues on this topic describes how exome sequencing was performed in 10 trios (affected child and healthy parents). After ruling out copy number variations by array CGH analysis, exome sequences were obtained and subjected to a bioinformatic pipeline to exclude common variants and those predicted not to be pathogenic. Then, variants were sought that were present only in an affected child but not in the parents. This led to the identification of convincing candidate mutations in 7 of the 10 trios (22).

Family-based filtering strategies: homozygosity mapping and linkage approaches

  1. Top of page
  2. Abstract
  3. Conflicts of interest
  4. Whole-exome sequencing
  5. WES strategies
  6. Intersection filtering
  7. De novo mutations
  8. Family-based filtering strategies: homozygosity mapping and linkage approaches
  9. IBD inference from WES/WGS data
  10. Conclusions
  11. Acknowledgement
  12. References

Pierce and coworkers examined a non-consanguineous family in which two sisters had Perrault syndrome, a recessive disorder characterized by ovarian dysgenesis in females, sensorineural deafness, and neurological manifestations. WES of only one of the sisters revealed exactly one gene with two rare variants both predicted to be pathogenic: HSD17B4, which encodes 17β-hydroxysteroid dehydrogenase type 4 (24). However, the experience of most labs involved in WES projects suggests that it is extremely difficult to narrow down the search to exactly one gene based on only a single exome sequence. For this reason, classical approaches such as homozygosity mapping (25) and linkage analysis (26) have been used to exclude irrelevant parts of the exome or genome prior to the application of other computational filters.

For instance, Bilgüvar and colleagues examined a small consanguineous kindred with two sibs having microcephaly and other brain malformations. Whole-genome genotyping was used to identify shared homozygous segments that together made up 80 cM. The analysis of WES data concentrated on these regions, and a novel frameshift mutation in WDR62 was identified. To confirm WDR62 as the disease gene, Sanger sequencing was used to identify WDR62 mutations in other kindreds (27). Similar approaches using homozygosity mapping or linkage analysis to narrow down the candidate regions have been used successfully in a number of WES and related studies to identify or confirm novel disease genes (28–35).

IBD inference from WES/WGS data

  1. Top of page
  2. Abstract
  3. Conflicts of interest
  4. Whole-exome sequencing
  5. WES strategies
  6. Intersection filtering
  7. De novo mutations
  8. Family-based filtering strategies: homozygosity mapping and linkage approaches
  9. IBD inference from WES/WGS data
  10. Conclusions
  11. Acknowledgement
  12. References

Roach and coworkers performed an analysis of genetic inheritance in a family quartet by WGS, using a Hidden Markov Model (HMM) to model the Mendelian inheritance states at each reference position. There are four possible states of inheritance, depending on whether the two children shared alleles from both parents, from only the mother or only the father, or shared none. Both children in this family had two recessive disorders, Miller syndrome and primary ciliary dyskinesia, as had previously been characterized by exome sequencing (16). The inheritance states in the family quartet as inferred by the HMM were observed in large contiguous blocks, allowed the number of candidate genes for both of these Mendelian disorders to be narrowed down to only four (21).

The authors of this review developed an HMM-based algorithm to infer chromosomal regions that are IBD based only on the (potentially noisy) exome sequences of the affected siblings. In consanguineous families, affected individuals share two IBD haplotypes inherited from a single common ancestor [homozygosity-by-descent (HBD)]. The disease gene must be located somewhere within the HBD haplotype block, which is the basis of homozygosity mapping (25). In the general case in which the parents are not consanguineous, each affected sibling inherits the same haplotype from each parent. Such chromosomal regions are referred to as IBD = 2 (Fig. 1)

image

Figure 1. In autosomal recessive disorders, the disease gene must be located in a chromosomal region in which the paternal and maternal haplotypes are both identical-by-descent (IBD = 2).

Download figure to PowerPoint

Our algorithm uses a non-homogeneous HMM that employs local recombination rates to identify IBD = 2 chromosomal regions in children of consanguineous or non-consanguineous parents solely based on genotype data of siblings derived from high-throughput sequencing platforms. Inference of IBD = 2 regions can be used to identify the chromosomal regions that are compatible with the inheritance patterns of a recessive monogenic disorder and can be combined with previous methods for filtering out common variants and for predicting potentially pathogenic sequence changes as described above. This approach was first successfully used in identifying PIGV as the disease gene in Hyperphosphatasia-Mental Retardation Syndrome (36, 37).

Conclusions

  1. Top of page
  2. Abstract
  3. Conflicts of interest
  4. Whole-exome sequencing
  5. WES strategies
  6. Intersection filtering
  7. De novo mutations
  8. Family-based filtering strategies: homozygosity mapping and linkage approaches
  9. IBD inference from WES/WGS data
  10. Conclusions
  11. Acknowledgement
  12. References

It has been less than 2 years since the first publication by Sarah Ng and coworkers on the application of WES for disease-gene identification (1), but it already seems clear that the field of human genetics has entered a new era, in which we can hope to quickly elucidate the molecular basis of most remaining Mendelian disorders and to radically improve our ability to perform timely and accurate diagnostics for persons with rare diseases. The trend towards cheaper and higher throughput DNA sequencing has been proceeding substantially more rapidly than predicted even by ‘Moore’s law’ for computing hardware, according to which the number of transistors on a chip roughly doubles every 2 years. The primary challenge in diagnostics in human genetics is likely to shift from the mere identification of sequence variants to the interpretation of the variants, and bioinformatics will play a key role at all levels of data analysis and interpretation. Currently, laboratories involved in many WES projects, including our own, report success rates in identifying novel disease genes of at most about 50%. There are many potential reasons for failure in WES projects, some of which have been mentioned in this short overview. As we move forward, a number of things will be needed to achieve the full promise of WES for disease-gene discovery and later on for routine diagnostics (38).

Improvements of the software used for the analysis of WES/WGS data are sorely needed and continued algorithmic development will be required as target capture technologies continue to evolve. As soon as WGS becomes economically feasible, algorithms for reliably capturing structural variation (39) and for interpreting variants in non-coding conserved sequences will become essential.

Standard protocols and ontologies will be required for interoperability of databases; a major problem in studying rare diseases is that one can never be entirely certain that a given gene is in fact the sought after disease gene until a second unrelated individual or family is described with a mutation in the same gene and a comparable phenotype. Standard ontologies for describing the human phenotype (40–42) combined with other standards for reporting mutations, classifying diseases, and storing the genotypes of WES or WGS data will be an essential component to allow communication and interoperability. At present, there is no comprehensive database of human mutations and phenotypes that can be used for the interpretation of WES/WGS data, although efforts are underway in the community. Such a database could be used to connect groups at different locations each of which has identified single individuals or families with mutations in a novel candidate for a rare disease and thus help to accelerate efforts to identify the remaining disease genes in the human genome.

References

  1. Top of page
  2. Abstract
  3. Conflicts of interest
  4. Whole-exome sequencing
  5. WES strategies
  6. Intersection filtering
  7. De novo mutations
  8. Family-based filtering strategies: homozygosity mapping and linkage approaches
  9. IBD inference from WES/WGS data
  10. Conclusions
  11. Acknowledgement
  12. References