SEARCH

SEARCH BY CITATION

Keywords:

  • next-generation sequencing;
  • exome sequencing;
  • next-generation sequencing analysis

ABSTRACT

  1. Top of page
  2. ABSTRACT
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgment
  8. References
  9. Supporting Information

Novel genes are now identified at a rapid pace for many Mendelian disorders, and increasingly, for genetically complex phenotypes. However, new challenges have also become evident: (1) effectively managing larger exome and/or genome datasets, especially for smaller labs; (2) direct hands-on analysis and contextual interpretation of variant data in large genomic datasets; and (3) many small and medium-sized clinical and research-based investigative teams around the world are generating data that, if combined and shared, will significantly increase the opportunities for the entire community to identify new genes. To address these challenges, we have developed GEnomes Management Application (GEM.app), a software tool to annotate, manage, visualize, and analyze large genomic datasets (https://genomics.med.miami.edu/">https://genomics.med.miami.edu/">https://genomics.med.miami.edu/). GEM.app currently contains ∼1,600 whole exomes from 50 different phenotypes studied by 40 principal investigators from 15 different countries. The focus of GEM.app is on user-friendly analysis for nonbioinformaticians to make next-generation sequencing data directly accessible. Yet, GEM.app provides powerful and flexible filter options, including single family filtering, across family/phenotype queries, nested filtering, and evaluation of segregation in families. In addition, the system is fast, obtaining results within 4 sec across ∼1,200 exomes. We believe that this system will further enhance identification of genetic causes of human disease.


Introduction

  1. Top of page
  2. ABSTRACT
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgment
  8. References
  9. Supporting Information

The development of next-generation sequencing (NGS) technologies has revolutionized human genetics research [Biesecker, 2010]. Whole exome sequencing (WES), an early genomic application, is a rapid, high-throughput, and cost-effective approach and has been widely used to identify pathogenic variation especially in Mendelian disorders [Ng et al., 2010; Velinov et al., 2012]. In addition, the recent validation of an excess of rare variation (<0.5% minor allele frequency) in the human species provides further support for the hypothesis that rare changes explain potentially a significant portion of inheritance in so-called complex human diseases [Tennessen et al., 2012]. Such considerations have further stimulated the large-scale production of genomic datasets leading to significant challenges for efficient data management and analysis. The data-intensive nature and lengthy computational pipelines for genomic data have also increasingly removed clinically and molecular trained investigators from direct access to data analysis. This leads to missed opportunities in identifying novel causative gene variants and creates unnecessary bottlenecks in the discovery process. Finally, large disease-oriented research consortia and collaborative networks of investigators are seeking new ways for communal data analysis and the sharing of variant data.

To address these concerns, a number of tools have been developed to analyze and visualize genomic variant data. Although these available tools are very useful, they are also limited in specific ways. VAAST [Yandell et al., 2011] is a powerful tool to identify genes likely involved in disease, but is not intended for the browsing of variant and annotation data. To visualize variant data, VARSIFTER [Teer et al., 2012] was developed, but is designed for a desktop computer and can presently only manage a modest amount of data. Lastly, the sequence variant analyzer [Ge et al., 2011] includes many powerful analysis packages, but is rather complex and does not easily facilitate collaborative efforts. To address these limitations, we have developed Genomes Management Application (GEM.app), which is an analysis toolset accessible via modern Web browsers allowing for easy, quick, and collaborative analysis of genomic data. GEM.app is currently growing significantly in size and we are observing the identification of disease genes at the pace of >1 per month [Gonzalez et al., in press; Martin et al., 2013; Montenegro et al., 2012; Tesson et al., 2012; Velinov et al., 2012].

Methods

  1. Top of page
  2. ABSTRACT
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgment
  8. References
  9. Supporting Information

Bioinformatics

The Illumina CASAVA v1.8 pipeline was used to produce 100 bp sequence reads. BWA software [Li and Durbin, 2010] was used to align sequence reads to the human genome (hg19) and variants were called using the GATK v1.4 software package [DePristo et al., 2011; McKenna et al., 2010]. Variants were submitted to SeattleSeq for annotation. Further annotation was obtained using data from dbSNP137, variant frequency data from the NHLBI Exome Sequencing Project (Exome Variant Server, NHLBI Exome Sequencing Project [ESP], Seattle, WA Project [Exome Variant Server, 2012]), the HGMD Human Gene Mutation Database [Stenson et al., 2012], and the OMIM Online Mendelian Inheritance in Man database (OMIM Online Mendelian Inheritance in Man, December, 2012).

GEDI Pipeline and GEM.app GUI

GEnome Data Import (GEDI) is an automated pipeline that uses different available software tools (VCFtools and ENSEMBL VEP) and PERL scripts to automate processing, annotation, backfilling of VCF files and data upload into our GEM.app database (mySQL 5.5). GEDI is optimized to extract relevant information for each variant and to transform and structure the genomic information to guarantee fast query execution using the GEM.app GUI. The GEM.app GUI was implemented in layers to allow enough flexibility to handle data efficiently in a fast growing environment. To execute queries, API's were developed using PHP 5.3. GEM.app's user friendly interface was built with JQUERY, HTML5, JSON, and CCS3. Slickgrid was used to display query results.

Results

  1. Top of page
  2. ABSTRACT
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgment
  8. References
  9. Supporting Information

Basic Principles and Description of GEM.app

Several design principles went into the development of GEM.app. We created a graphical user interface for this Web application that makes genomic data accessible for physician scientists and molecular trained PhDs (Fig. 1). The easy to use interface allows a user to design custom queries and results are typically returned in seconds. This is a key feature as it allows for an iterative working approach, instant refining of filters, and immediate testing of different Mendelian segregation patterns—even if hundreds of exomes are being queried. GEM.app is a Web application developed using the latest Internet standards and languages—JQuery, JSON, HTML5, SlickGrid and is compatible with Safari 5.1.7, Chrome 22.0.1229.94, and Firefox 16.0.2 or later versions. We have developed a powerful access-control system that assigns an account to each user. This account system is context-aware and shows or hides phenotype-specific customized filter-options and variant annotation. Each user only sees their own detailed data; yet, visible to all users are anonymous variant counts derived from the entire database. Examples include global minor allele and genotype frequencies in GEM.app, number of familial segregation events of a specific variant under a selected Mendelian trait, or number of SNVs/indels per gene of interest. These latter features encourage collaboration for studies that are focused on rare Mendelian-type variants, as they give hints on the existence of a second family for a given new candidate gene. The access system further allows for sharing of access to specific exomes and a collaborative analysis. Examples of existing successful collaborations in GEM.app include the international Inherited Neuropathy Consortium with >250 exomes in GEM.app and a network of 15 collaborating groups in as many countries working on hereditary spastic paraplegia (HSP) (>400 exomes).

image

Figure 1. GEM.app pipeline and graphical user interface. A: Starting page with currently seven analysis modules. B: Example of “Variants within families” filter module. There are at least 13 context-specific different filter categories available. Preset filters autofill a number of variables and allow for a “three click” search. C: Result screen of a GEM.app query. Different control options are detailed.

Download figure to PowerPoint

Data Security

We only store deidentified data in GEM.app. This includes a numerical identifier for each sample and family, sex, and deidentified pedigrees. Access is password controlled and all data transfer between users and servers is encrypted by a VeriSign class 3 server certificate, which is comparable to online banking security. Further, GEM.app currently resides on servers of the University of Miami, which are behind a firewall and monitored 24/7 for cyber attacks. In the future, we envision moving GEM.app to a true cloud environment.

GEDI Pipeline

Part of GEM.app is a processing pipeline, GEDI module, which handles processing of VCF files, annotation, “backfilling” of variant data, cosegregation analysis within families, and calculation of counts across all samples, including minor allele frequencies (Supp. Fig. S1). When data are processed and imported into GEM.app, information on each sample is required such as: affection status, pedigree individual ID, family ID, and possible Mendelian inheritance patterns. Using this information for each sample allows GEDI to automatically determine whether a specific variant follows a given segregation model (i.e., autosomal dominant). If a user selects a particular inheritance pattern to analyze, GEM.app will return variants that fit this model in a table format (Fig. 1C). GEDI achieves annotation by utilizing the SeattleSeq annotation server (http://snp.gs.washington.edu/), which includes conservation and amino acid substitution scores (GERP, PhastCons, Grantham, PolyPhen2) [Adzhubei et al., 2010; Davydov et al., 2010; Siepel et al., 2005]. In addition, GEDI obtains data from dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/), NHLBI EVS (http://evs.gs.washington.edu/EVS/), OMIM (http://www.ncbi.nlm.nih.gov/omim), String-db (http://string-db.org/), and ENSEMBL (http://www.ensembl.org/). Backfilling is a process that retrieves sequence information from all previously added VCF files for each novel variant incorporated into GEM.app. Each addition of new exomes initiates a reanalysis of variant counts, recalculation of allele and genotype frequencies, and so on. The GEDI process is fully automated and takes advantage of the 5,000-node computer cluster named Pegasus at the University of Miami.

Graphical User Interface

After the GEDI/GEM.app pipeline is completed, data are accessible to registered users through an online graphical user interface (https://genomics.med.miami.edu). Users interact with query modules, which are presented as tiles with descriptive names. Currently seven different modules are available (Fig. 1A). These include simple modules (“Quick finds”) that search for genes (“Gene look-up”) or genomic positions (“Position look-up”). Further, all accessible exomes are individually listed (“My samples”) complete with basic phenotypes, external and internal numbering, quality measures from alignment and variant calling, possible traits, processing details, and other information. Most queries happen in extended filtering modules. “Variants within families” provides detailed options for variant filtering within single families, but can process as many families at once as requested. Fifteen different filter option fields are present, including selection of genomic positions, variant function class (synonymous, nonsynonymous, etc.), conservations scores, quality scores, Mendelian traits, or lists of known genes for phenotypic groups (i.e., inherited peripheral neuropathy genes) (Fig. 1B). The module “Genes across families” allows for identifying genes that have multiple hits in the same gene across multiple families and/or phenotypes. Advanced filter modules contain two-step or nested queries. This allows for filtering with a strict set of criteria resulting in few hits in a first step and then in a second step these gene are taken into account for a query with more relax criteria to find additional evidence for a gene in a larger set of exomes. All filter modules have the option of choosing from four preset filter criteria: user-defined, relaxed, moderate, or strict (Fig. 1).

The output is presented in a table that contains 28 different columns with annotation. Each row presents a variant. There are an additional 44 columns of annotation that can be added via a “column manager” (Supp. Fig. S2). Columns can also be sorted and rearranged via drag and drop. Several fields in the output table contain hyperlinks that directly link to pedigrees, to the UCSC genome browser (http://genome.ucsc.edu/), a gene-network viewer (http://string-db.org), NCBI (http://www.ncbi.nlm.nih.gov/), or OMIM (http://omim.org/) (Supp. Fig. S2). GEM.app connects several publicly available databases directly to data being analyzed within the same Web page, thus making follow-up analysis less tedious (Fig. 1C).

Performance

Queries of individual families are typically finished in less than 1 sec. Benchmarking of more than 10,000 queries of over 60 different users demonstrate that all the different query modules achieve an average query time of less than 8 sec, with the exception of the two-step module (Fig. 2). The most popular query module, “Variants within families,” produces output in ∼4 sec. Factors that negatively influence query times include: (1) querying across multiple phenotypes, (2) using fewer filter criteria, and (3) using nested filter options. Generally, the speed and simplicity of GEM.app allows an investigator to iteratively test different filtering strategies (such as multiple possible traits) within a few minutes without the need of programming. Searching across a large number of samples/families is fast: a query for conserved and rare mutations in 124 known genes for related neurodegenerative diseases (Charcot-Marie-Tooth [CMT], HSP, distal hereditary motor neuron, etc.) across 481 exomes obtained results within 10 sec. Querying these same 124 genes across 1,200 samples finished within 20 sec.

image

Figure 2. Performance of GEM.app. Average search times over 10,000 queries from over 60 different users. By far, the most popular module is “Variants within families,” which returns results in ∼4 sec across 1,200 exomes. Individual families are typically instantly returned.

Download figure to PowerPoint

Currently, 103 users are registered to use GEM.app, which includes 40 principle investigators from 15 different countries studying over 50 different phenotypes (Fig. 3A). Since the release of GEM.app in April 2012, we have experienced a significant increase in the number of users and queries (Fig. 3B and C).

image

Figure 3. Current usage of GEM.app. A: Geographical overview of principle investigators with data in GEM.app. B: The number of registered users has grown to >103 in the past 6 months. C: The usage of GEM.app has increased significantly since its release. ASHG—American Society for Human Genetics annual meeting.

Download figure to PowerPoint

Discussion

  1. Top of page
  2. ABSTRACT
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgment
  8. References
  9. Supporting Information

The development of GEM.app was motivated by the need of a NGS analysis tools for physician scientists and biomedical investigators with limited computational experience. Importantly, we needed a tool to manage and organize large sets of exome data as they become available from large-scale sequencing projects. Increasing sample sizes are required to identify genes for rare, highly heterogeneous Mendelian disorders and rare familial forms of phenotypes with a complex genetic architecture. The modular structure of GEM.app allows for future implementation of new computational strategies, such as multicore processing and multithreading, to address the increasing size of data and scaling to whole genomes. Because investigators worldwide are producing small to large exome/genome datasets, the exchange of variant information and uniform analysis provides an often untapped opportunity for increasing genetic power via collaborations. A variety of collaborative models from loose networks to tightly integrated consortia are feasible within the flexible access system in GEM.app. Existing projects range from the ability of direct shared access to full exomes at different sites to using the anonymous variant counts across all datasets, which are available to every registered user. Further, consortia can decide to limit data analysis to specific sites, or, potentially more advantageous, data analysis can be spread over a larger number of institutions. The intuitive interface allows for the participation of investigators with different skill sets, including expert geneticists, physician scientists, or basic scientists.

The GEM.app framework has recently been utilized to identify clinically relevant variants in a number of disorders, such as inherited deafness, CMT disease, HSP, and dilated cardiomyopathies [Diaz-Horta et al, 2012; McCorquodale et al., 2011; Montenegro et al., 2011; Montenegro et al., 2012; Norton et al., 2012]. GEM.app has also been applied to identify novel genes [Martin et al., 2013; Montenegro et al., 2012; Osterloh et al., 2012; Sirmaci et al., 2011; Tesson et al., 2012; Velinov et al., 2012]. The majority of these studies are led by physician scientists interested in the application of WES for the elucidation of genetic variation involved in their disorders of interest, thus demonstrating the power of GEM.app to connect investigators to their NGS datasets. Some of the newer findings have only been able to accomplish by connecting relatively small datasets from around the world in this centralized resource and the ability of individual investigators to share their results on a trusted platform.

In summary, GEM.app will enable researchers of all computational backgrounds to visualize and analyze genomic variant data. Using an automated pipeline, GEM.app organizes thoroughly annotated variant data to be directly connected to various genomics resources and allows investigators to directly analyze and interpret mutations from large sets of samples. GEM.app offers the ability for biomedical researchers to share data and perform joint analysis simultaneously. These features promote collaborations leading to the identification of novel disease associated genes.

Acknowledgment

  1. Top of page
  2. ABSTRACT
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgment
  8. References
  9. Supporting Information

We are thankful to Yamil Velez for helpful discussions about details of data mining.

References

  1. Top of page
  2. ABSTRACT
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgment
  8. References
  9. Supporting Information
  • Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. 2010. A method and server for predicting damaging missense mutations. Nat Methods 7:248249.
  • Biesecker LG. 2010. Exome sequencing makes medical genomics a reality. Nat Genet 42:1314.
  • Davydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S. 2010. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol 6:e1001025.
  • DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, et al. 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491498.
  • Diaz-Horta O, Duman D, Foster J 2nd, Sırmacı A, Gonzalez M, Mahdieh N, Fotouhi N, Bonyadi M, Cengiz FB, Menendez I, Ulloa RH, Edwards YJ, Züchner S, Blanton S, Tekin M. 2012. Whole-exome sequencing efficiently detects rare mutations in autosomal recessive nonsyndromic hearing loss. PLoS ONE 7:e50628.
  • Exome Variant Server. 2012. NHLBI exome sequencing project (ESP).
  • Ge D, Ruzzo EK, Shianna KV, He M, Pelak K, Heinzen EL, Need AC, Cirulli ET, Maia JM, Dickson SP, Zhu M, Singh A, Allen AS, Goldstein DB. 2011. SVA: software for annotating and visualizing sequenced human genomes. Bioinformatics 27:19982000.
  • Gonzalez M, Nampoothiri S, Kornblum C, Oteyza AC, Walter J, Konidari I, Hulme W, Speziani F, Schöls L, Züchner S, Schüle R. 2013. Mutations in phospholipase DDHD2 cause autosomal recessive hereditary spastic paraplegia (SPG54). Eur J Hum Genet. [Epub ahead of print].
  • Li H, Durbin R. 2010. Fast and accurate long-read alignment with burrows-wheeler transform. Bioinformatics 26:589595.
  • Martin E, Schule R, Smets K, Rastetter A, Boukhris A, Loureiro JL, Gonzalez MA, Mundwiller E, Deconinck T, Wessner M, Jornea L, Caballero Oteyza AC, et al. 2013. Loss of function of glucocerebrosidase GBA2 is responsible for motor neuron defects in hereditary spastic paraplegia. Am J Hum Genet 92:238244.
  • McCorquodale DS, 3rd, Montenegro G, Peguero A, Carlson N, Speziani F, Price J, Taylor SW, Melanson M, Vance JM, Zuchner S. 2011. Mutation screening of mitofusin 2 in charcot-marie-tooth disease type 2. J Neurol 258:12341239.
  • McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. 2010. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:12971303.
  • Montenegro G, Powell E, Huang J, Edwards YJK, Beecham G, Hulme W, Siskind C, Vance J, Shy M, Züchner S. 2011. Exome sequencing allows for rapid gene identification in a charcot-marie-tooth disease family. Ann Neurol 3:464470.
  • Montenegro G, Rebelo AP, Connell J, Allison R, Babalini C, D'Aloia M, Montieri P, Schule R, Ishiura H, Price J, Strickland A, Gonzalez MA, et al. 2012. Mutations in the ER-shaping protein reticulon 2 cause the axon-degenerative disorder hereditary spastic paraplegia type 12. J Clin Invest 122:538544.
  • Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA, Shendure J, Bamshad MJ. 2010. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet 42:3035.
  • Norton N, Robertson PD, Rieder MJ, Zuchner S, Rampersaud E, Martin E, Li D, Nickerson DA, Hershberger RE, on behalf of the National Heart, Lung and Blood Institute GO Exome Sequencing Project. 2012. Evaluating pathogenicity of rare variants from dilated cardiomyopathy in the exome era. Circ Cardiovasc Genet 5:167174.
  • OMIM Online Mendelian Inheritance in Man. 2012. An online catalog of human genes and genetic disorders. Baltimore, MD: Johns Hopkins University.
  • Osterloh JM, Yang J, Rooney TM, Fox AN, Adalbert R, Powell EH, Sheehan AE, Avery MA, Hackett R, Logan MA, MacDonald JM, Ziegenfuss JS, et al. 2012. dSarm/Sarm1 is required for activation of an injury-induced axon death pathway. Science 337:481484.
  • Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, et al. 2005. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15:10341050.
  • Sirmaci A, Spiliopoulos M, Brancati F, Powell E, Duman D, Abrams A, Bademci G, Agolini E, Guo S, Konuk B, Kavaz A, Blanton S, et al. 2011. Mutations in ANKRD11 cause KBG syndrome, characterized by intellectual disability, skeletal malformations, and macrodontia. Am J Hum Genet 89:289294.
  • Stenson PD, Ball EV, Mort M, Phillips AD, Shaw K, Cooper DN. 2012. The human gene mutation database (HGMD) and its exploitation in the fields of personalized genomics and molecular evolution. Curr Protoc Bioinform. Chapter 1:Unit1.13.
  • Teer JK, Green ED, Mullikin JC, Biesecker LG. 2012. VarSifter: visualizing and analyzing exome-scale sequence variation data on a desktop computer. Bioinformatics 28:599600.
  • Tennessen JA, Bigham AW, O'Connor TD, Fu W, Kenny EE, Gravel S, McGee S, Do R, Liu X, Jun G, Kang HM, Jordan D, et al. 2012. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337:6469.
  • Tesson C, Nawara M, Salih MA, Rossignol R, Zaki MS, Al Balwi M, Schule R, Mignot C, Obre E, Bouhouche A, Santorelli FM, Durand CM, et al. 2012. Alteration of fatty-acid-metabolizing enzymes affects mitochondrial form and function in hereditary spastic paraplegia. Am J Hum Genet 91:10511064.
  • Velinov M, Dolzhanskaya N, Gonzalez M, Powell E, Konidari I, Hulme W, Staropoli JF, Xin W, Wen GY, Barone R, Coppel SH, Sims K, Brown WT, Zuchner S. 2012. Mutations in the gene DNAJC5 cause autosomal dominant kufs disease in a proportion of cases: study of the parry family and 8 other families. PLoS ONE 7:e29729.
  • Yandell M, Huff C, Hu H, Singleton M, Moore B, Xing J, Jorde LB, Reese MG. 2011. A probabilistic disease-gene finder for personal genomes. Genome Res 21:15291542.

Supporting Information

  1. Top of page
  2. ABSTRACT
  3. Introduction
  4. Methods
  5. Results
  6. Discussion
  7. Acknowledgment
  8. References
  9. Supporting Information

Disclaimer: Supplementary materials have been peer-reviewed but not copyedited.

FilenameFormatSizeDescription
humu22305-sup-0001-si.pdf142KSupplementary Information

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.