Integration of global resources for human genetic variation and disease†
For the Deep Phenotyping Special Issue
There is an increasing accumulation of data on disease-related mutations and associated phenotypes in a wide variety of databases worldwide. Exploiting these data in the context of whole genome sequencing is inhibited because the phenotype information in these databases is often difficult to search meaningfully or relate between data sets, and automated computational integration is not possible. Key to this integration is the development of ontology-based methods for describing diseases in terms of their component phenotypes. This would allow analysis of variation in disease manifestation, relationships between diseases and phenotypes in model organisms, and linking diseases to gene mutations, pathways, and phenotypes. Building a systematic link to phenotypes manifested in model organisms will be of particular importance with the advent of new, large-scale phenotyping projects such as the International Mouse Phenotyping Consortium. In addition to improved semantic description, funding and organizational innovations are required to support this integration. In particular, a series of national or international hubs to hold genotype and phenotype data are needed which could feed data to a central database. In addition, better coordination of clinical and bioinformatics experts and, crucially, development of a transnational funding and international coordination infrastructure will be required. Hum Mutat 33:813–816, 2012. © 2012 Wiley Periodicals, Inc.
Data on Human Variation and Disease is Scattered Across the World
It is now 45 years since McKusick (2007) published Mendelian Inheritance in Man; the first attempt to systematically assemble the corpus of knowledge about the relations between human genetic variation and the phenotypes or diseases to which it gives rise. Since then, the data have been hugely augmented and are now available as an online knowledge base [Amberger et al., 2011]. It has information on all known Mendelian disorders and more than 12,000 genes. This tour de force of data gathering and analysis represents the state of the art with respect to our knowledge of the effects of single gene mutations on the human phenotype.
More complex common diseases, often with multigene or multifactorial etiology, account for the predominant worldwide burden of genetically ascribable morbidity and mortality. These have been, and continue to be, investigated through, for example, genome-wide association studies (GWAS), making use of the framework provided by the human genome project and data from its successors, such as the 1000 Genomes Project [1000 Genomes Consortium, 2010], to identify associations between genetic polymorphisms in the population and disease. Data from GWAS studies are disseminated through multiple knowledge bases throughout the world, such as the Genetic Association, the Database of Genotypes and Phenotypes (dbGaP) [Mailman et al., 2007], and GWAS central [Thorisson et al., 2009a]. These data sets have significant overlaps but the databases do not systematically share data.
Finally, we have a proliferation of locus-specific databases (LSDBs), which concentrate on one or a small number of genes of relevance to one or a group of diseases of interest. These often have rich phenotypic data but much less coverage of the genome in contrast to association study databases, which have genome-wide coverage but generally rather shallow phenotype data. More than 4,100 LSDBs are now listed at the Gen2Phen Knowledge Centre Website (www.gen2phen.org/data/lsdbs) [Webb et al., 2011].
What is striking about the location and accessibility of data connecting human genetic variation with phenotypes is that it is scattered around the world in multiple databases, knowledge bases (where the primary data are not searchable), and flat files on servers [Patrinos and Brookes, 2005; Thorisson et al., 2009b]. Phenotypic information particularly is often recorded in such a way that it is extremely difficult to search for data and retrieve it, and nigh impossible to computationally integrate data from multiple sources without a great deal of manual effort. Many databases do not permit programmatic access to data and have to be searched manually, generally through an HTML interface, and may require registration before use.
The dispersed nature of data for clinical genetics is largely a consequence of the history of the discipline, but we are now entering a phase where we can build on the framework provided by the genome projects to use informatics to integrate all these data and analyze it genome wide. In order to do this, we need both standardization and a sustainable infrastructure to aggregate and analyze the data.
Need for Semantic Standardization of Phenotypic and Disease Descriptions
We have had standard terminologies used for clinical management and billing, epidemiology, and summary diagnosis for some time, notably the International Classification of Diseases and Systematized Nomenclature of Medicine—Clinical Terms [International Health Terminology Standards Development Organisation, 2011; World Health Organization, 2004]. More recently, a need for the development of formal description frameworks for human disease based on ontologies, describing diseases in terms of their associated observable phenotypes, has been outlined [Schofield et al., 2007; Schofield et al., 2010]. There are several major advantages in breaking down the description of disease into its phenotypic components. For example,
- 1.It allows for the analysis of variation in disease manifestation, time course, and therapeutic response.
- 2.It allows relationships to other diseases and phenotypes in model organisms to be established where there is only partial but significant overlap of symptoms.
- 3.It allows relationships between gene function and phenotype to be established where one aspect of the phenotype is common to several genes on the same pathway.
- 4.It allows relationships between pathways and gene networks underlying different diseases to be uncovered.
Such an approach would complement new attempts to classify diseases based on their underlying molecular biology [Committee on a Framework for Development a New Taxonomy of Disease National Research Council, 2011] and has been greatly facilitated by the development of the human phenotype ontology (HPO) [Robinson et al., 2008], which has been or is now being adopted by a number of databases and resources such as the Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources [Firth et al., 2009], the National Center for Biotechnology Information (NCBI) the Genetic Testing Registry [Javitt et al., 2010] and ClinVar database (www.ncbi.nlm.nih.gov/clinvar), and the Wellcome Trust Deciphering Developmental Disorders Project [Firth and Wright, 2011]. HPO provides a human counterpart for the established mammalian phenotype ontology [Smith et al., 2005], which is heavily used by many model organism databases. Using this formalism, diseases can be linked to model organism phenotypes by linking phenotype ontologies across species.
Oti et al. (2009) have examined the coherence of phenotype data in several major phenome databases including Online Mendelian Inheritance in Man (OMIM), and conclude that aggregation of data from several databases is needed to allow the full potential of these data to be realized, and that this in turn depends on the availability of a fine-grained phenotype ontology and feature frequency data.
Adoption of the same phenotype ontologies permits integration of data between resources as well as computational analysis. This means that even if data are located in different databases, they can be pooled and analyzed together. As well as standardizing phenotypic descriptions it is necessary to agree on syntactic standards so that databases and analytical tools can understand the structure of the datasets they are given. Initial efforts have been made in this area referring both to gene and phenotype data [Swertz et al., 2010] and to bioscience data more generally [Sansone et al., 2012].
Widespread adoption of such standards for phenotype description could also add value to data collected by the biobanking and patient-facing clinical communities. Biobanks collect material and phenotype information on individuals for longitudinal studies. An integrated infrastructure for biobanks including the potential to share information is being established under the auspices of the European Commission (www.bbmri.eu/). By facilitating the linkage of such data sets, particularly with the likely emergence of individual genome sequencing, there is potential to add considerable value to gene and pathway identification studies and, in the slightly longer term, to feed into systems biology models useful in emerging areas such as personal genomics and individualized therapy [Shublaq, 2012a]. In the clinic, electronic health record systems are still under development and progress is patchy. However, it is to be expected that robust and generalizable systems will emerge over the next few years, and integrating these with standard phenotype descriptions would serve both as an aid in the clinic itself by allowing access to a broader set of information about individual phenotypes and diseases, and for the research community by providing access to more information about associations between phenotypes, treatment models, etc. Again, pan-European strategies in this area are under development [Shublaq, 2012b].
The requirement for semantic and syntactic standardization has been extensively considered for model organism databases, particularly for the mouse [Smedley et al., 2008], and recommendations have been made for the development and adoption of international standards. Similar global recommendations have been made under the ELIXIR infrastructure project funded by the European Commission (http://www.elixir-europe.org/).
Integration and Analysis of Human Genotype/Phenotype Information with Model Organism Data
Standardization of human phenotypic data to the HPO permits human data to be integrated and coanalyzed with the huge amount of data available from model organisms. The amount of data available for the mouse, for example, is currently very large. The Mouse Genome Database [Blake et al., 2011] currently contains data on 15,522 genes with mutant alleles in mice and there is phenotypic information on circa 8,200 of the mouse's approximately 25,000 protein-coding genes, providing an extremely rich resource that can be accessed to help understand and interpret human genetic data. This will soon be augmented by the recently initiated International Mouse Phenotyping Consortium (www.mousephenotype.org), which has begun the phenotyping of mouse strains with mutations in all 25, 000 protein-coding gene in the mouse genome [Abbott 2010; Skarnes et al., 2011; Brown and Moore, 2012]. More than $150 million has already been committed to the first stage of this project by the National Institutes of Health, Bethesda, Maryland, Genome Canada, Ottawa, ON, Canada, the Wellcome Trust, London, UK, and the UK Medical Research Council, London, UK.
Exploitation of model organism phenotype and genotype data can be extremely helpful for human genetics and functional genomics. We already have some examples and proofs of principle of how model organism and human data can be used together [Chen et al., 2012; Espinosa and Hancock, 2011; Hoehndorf et al., 2011; Webber et al., 2009], but it is not yet possible to fully exploit all the data available for the reasons discussed above. Areas likely to benefit from this include:
- –Functional pathway analysis/exploration/discovery.
- –Functional validation of human GWAS studies.
- –Reduction of the level of proof needed for confirmation of disease association.
- –Candidate gene prioritization within implicated loci.
- –Dissection and validation of disease/phenotype associations with copy number variation mutations in humans.
- –Identification of mouse models of human disease with particular importance for Rare and Orphan diseases.
- –Identification of pathogenicity of mutations.
Need for a Financially and Scientifically Sustainable Transnational Infrastructure of Databases and Coordination
With the globalization of research, data need to be made available for sharing across national boundaries. This challenges traditional models of scientific funding, where one nation funds infrastructure and resources for its own scientists. There has been increasingly sophisticated discussion of how such transnational resources might be made available through international cofunding efforts [Swertz et al., 2010], particularly for the model organism databases [International Arabidopsis Informatics Consortium, 2010], and legal models for transnational funding have been proposed by the European Commission. There are some impressive examples whereby transnational funding and international coordination of research strategy and infrastructure have produced excellent results and value for money, but astonishingly there are few examples of such coordinated activity in human genetics.
Priorities for the Global Coordination of Human Genotype/Phenotype Resources
Investment in developing the necessary ontologies for interoperability. Primarily, these are HPO, the phenotype and trait ontology, the mammalian pathology ontology, and the gene ontology, but the greatest challenge remains the development of an ontology of disease that links diseases with their constituent phenotypes.
Development of approaches and algorithms for integrating and analyzing phenotype data within and between species. Demonstrations of proof of principle [e.g., Chen et al., 2012; Espinosa and Hancock, 2011; Gkoutos et al., 2009; Hoehndorf et al., 2011] have been made, but these need to be developed in parallel to the ontologies themselves.
A formal computable model to describe diseases in terms of their associated phenotype terms needs to be developed.
Syntactic standards for phenotypic and genotypic data exchange need to be developed and agreed by the stakeholders.
A series of national or international hubs to hold genotype and phenotype data need to be set up into which data can be deposited. These hubs may then share data either dynamically in a federated fashion or through periodic data migration to a central database.
Improved coordination of clinical and bioinformatics experts is needed.
While meeting these suggested priorities will go a long way to facilitating data integration and analysis, the advent of next generation sequencing techniques, the sequencing of whole exomes, and the collection of deep phenotype data for precision medicine will soon generate unprecedented amounts of highly complex data, and without the necessary informatics and database infrastructure in place, we will be unable to realize the maximum value of these data. Organization and coordination will be the key to success and we look forward to a new phase of global cooperation in the mobilization of human genotype and phenotype data.