Getting Ready for the Human Phenome Project: The 2012 Forum of the Human Variome Project

Authors


Correspondence to: William S. Oetting MMC 485, 420 Delaware St. S.E. University of Minnesota, Minneapolis, MN 55455. E-mail: oetti001@umn.edu

ABSTRACT

A forum of the Human Variome Project (HVP) was held as a satellite to the 2012 Annual Meeting of the American Society of Human Genetics in San Francisco, California. The theme of this meeting was “Getting Ready for the Human Phenome Project.” Understanding the genetic contribution to both rare single-gene “Mendelian” disorders and more complex common diseases will require integration of research efforts among many fields and better defined phenotypes. The HVP is dedicated to bringing together researchers and research populations throughout the world to provide the resources to investigate the impact of genetic variation on disease. To this end, there needs to be a greater sharing of phenotype and genotype data. For this to occur, many databases that currently exist will need to become interoperable to allow for the combining of cohorts with similar phenotypes to increase statistical power for studies attempting to identify novel disease genes or causative genetic variants. Improved systems and tools that enhance the collection of phenotype data from clinicians are urgently needed. This meeting begins the HVP's effort toward this important goal.

Introduction

A forum of the Human Variome Project (HVP; www.humanvariomeproject.org) was held on 6 November 2012, as a satellite to the Annual Meeting of the American Society of Human Genetics in San Francisco, California. The theme of this meeting was “Getting Ready for the Human Phenome Project.” The meeting was cosponsored by the Human Genome Variation Society (HGVS; www.hgvs.org) and was chaired and opened by Peter Robinson from the Charité University Hospital, Berlin. Identifying the genetic contribution to disease requires a good understanding of the phenotype. This includes information on the features and natural history of a disease, the spectrum of complications, and disease subclasses, among other phenotypic information needed for genomic research (taken together, the “phenome”). Problems exist when different researchers define the same disease using different parameters and terms. Some of the issues include investigators using different or unique ontologies and annotation. What is needed is a set of standards with respect to clinical traits and phenotypic data so that information in different databases can be aggregated to create larger cohorts and ambiguous features of a given condition can be clarified. This will require investigators to create phenotype descriptors that include semantic standards for interoperability and reuse, technical standards for intercommunication, and a clear ethical framework for sharing data from patients and families. An HVP interest group to create such a set of standards to be used by all investigators is currently forming and all interested parties are invited to join. This meeting provided a format to discuss pertinent issues. The overall goal of this group is to interconnect multiple databases to provide more powerful research opportunities to help us understand the human phenome.

Phenotype Databases and Resources

The first session of the meeting, entitled “Phenotype Databases and Resources,” was chaired by Peter Robinson. In this session, there were several talks on the creation of large databases that contain both genotypic and phenotypic information for human diseases. Heidi Rehm, from the Partners HealthCare Center for Personalized Genetic Medicine and Harvard Medical School, spoke on “The International Collaboration for Clinical Genomics (ICCG): A unified system for the collection and curation of genomic variation.” The goal of the ICCG (www.iccg.org) is to create, maintain, and curate a publicly available database through collaboration with NCBI's ClinVar database, which contains both genomic and phenotypic data [Riggs et al., 2012]. The main focus of the ICCG is to facilitate data sharing of cases and variant annotations from clinical laboratories as well as connect these resources to researchers. Dr. Rehm presented the experience of the ISCA (International Standards for Cytogenomic Arrays) Consortium, the predecessor of the ICCG, on the submission and curation of copy number variant (CNV) data. Phenotypic data associated with CNVs were enhanced through the use of structured data collection tools, such as checkbox forms for physicians, and natural language processing. This effort is being extended from CNVs to sequence variation and additional tools, including a patient registry allowing phenotype collection directly from patients, are planned.

Donna Maglott, of the National Center for Biological Information (NCBI) at NIH, continued the discussion on large databases in her talk “Managing information about human phenotype at NCBI.” NCBI maintains many interconnected databases and is a central resource for all investigators interested in human disease. A recent database addition is MedGen (www.ncbi.nlm.nih.gov/medgen), which organizes information about phenotypes and supports unrestricted access to terms, their definitions, and their identifiers from multiple public databases. Users can access MedGen's data interactively, as well as via NCBI's E-utilities system (www.ncbi.nlm.nih.gov/books/NBK25501). MedGen is based on the foundation of the Unified Medical Language System® (UMLS®), which maintains terms used by different databases, the identifier used by each database for that term, relationships between terms in each database, and assigns a unique identifier to the set of terms from different databases representing the same concept. MedGen differs from UMLS in that it integrates terms from sources outside the scope of UMLS and updates information from some sources more frequently. For example, MedGen manages terms in use by the NIH Genetic Testing Registry (www.ncbi.nlm.nih.gov/gtr). MedGen welcomes suggestions for vocabularies and ontologies to be integrated into this system, as well as connections to be established to disease-related resources.

The next three talks introduced different software tools to help design databases as well as utilize the information in existing databases for phenotype creation and analysis. Ada Hamosh, of the McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, spoke on “PhenoDB: a new Web-based tool for the collection, storage and analysis of phenotypic features.” PhenoDB [phenodb.net; Hamosh et al., 2013] is freely available and includes a structured hierarchical database for rapid entry of phenotypic traits to allow for collection, storage, and analysis of clinical information to be used in genome analysis. The 2,900 features are organized hierarchically, based on OMIM clinical synopsis categories and subcategories, and are mapped to UMLS, Human Phenotype Ontology (HPO), and Elements of Morphology.

Carol M. Hamilton, of RTI International, Research Triangle Park, spoke on “The PhenX Toolkit: Common measures for collaborative research.” The goal of the PhenX Toolkit (www.phenxtoolkit.org) is to provide the scientific community with standard measures and protocols to use when designing or expanding any study with human subjects. The Toolkit is a catalog of 339 measures selected by content experts, and is available at no cost to investigators. The Toolkit provides all of the information needed to integrate and implement the measures. Inclusion of standard measures facilitates the combining of study cohorts to increase sample size and statistical power for both discovery and validation studies, and thus increases the overall impact of the study. Investigators can search the Genotypes and Phenotypes (dbGaP) database using PhenX as a filter and identify studies with variables that are identical, comparable, or related to PhenX measures.

Software tools that allow us to make sense of the vast amount of information available are also needed. Jackie MacArthur of the European Bioinformatics Institute (EMBL-EBI, Cambridge, spoke on “GOCI: An ontology-driven search and curation infrastructure for the NHGRI GWAS Catalog.” The GWAS Ontology and Curation Infrastructure (GOCI; www.ebi.ac.uk/fgpt/gwas) is designed to improve the curation, trait organization, and querying of the NHGRI GWAS catalog [Hindorff et al., 2009]. As of November 2012, the catalog includes 1,405 GWAS based publications with 7,506 SNPs and 736 distinct traits. The integration of catalog traits/phenotypes into the Experimental Factor Ontology [Malone et al., 2010] has enabled the automated production of an interactive GWAS diagram. Several features have been added to enable more powerful queries including semantic Web technologies for querying and visualizing catalogue data for specific phenotypes.

Eric Zhao, of the Department of Medical Genetics at the University of British Columbia, spoke on “PhenoKeeper: User-friendly, machine readable patient phenotype data storage to support clinical genetics and genomics research.” Clinical information from medical records is a rich source of phenotypic information. The challenge is getting clinicians to provide the data in a form that allows for robust phenotype matching between centers and among patients. PhenoKeeper is an HPO-based program for detailed patient phenotype input and tracking. It is both human friendly for data entry (especially for clinicians) as well as machine friendly, allowing for systematic and unambiguous meaning for describing phenotypes using both clinical traits as well as laboratory results. Importantly, PhenoKeeper enables ranked searching for patients that match a desired phenotypic profile. The ability to store patient phenotypes in a machine-readable way will greatly enhance the capacity to determine genetic variants associated with Mendelian and complex diseases that segregate with characteristic clinical presentation.

Computational Analysis of Phenotype

The second session, entitled “Computational Analysis of Phenotype,” was chaired by Marc Greenblatt of the Department of Medicine and the Vermont Cancer Center at the University of Vermont. In this session, a number of examples were provided on how computational techniques can help identify different disease phenotypes that can be used in genomic analysis. To begin this session, Andreas Zankl of the University of Queensland's Centre for Clinical Research presented “The ‘Skeletome’ platform, a knowledge-base for skeletal dysplasias.” Disorders of skeletal growth provide a rich assortment of disease phenotypes, with over 400 diseases associated with over 200 genes. Because most of these disorders are rare, clinical information is sparse. The Skeletome platform aims to improve this by capturing the available knowledge in the scientific community, for example, by allowing experts to collaboratively create review articles for each disorder. The Skeletome platform also extracts phenotype knowledge from patient descriptions that have been submitted to its archive. Other sources of information, such as existing archives or publications, can be easily added. The backbone of the platform is the ``Bone Dysplasia Ontology,” a formal representation of the skeletal dysplasia domain [Groza et al., 2012], and the HPO. Use of these ontologies ensures that data from various sources can be integrated in a consistent fashion and allow computational analyses, such as diagnostic algorithms, to be performed across at datasets. It is hoped that over time the Skeletome platform will become a universal hub for information on skeletal dysplasias. The methodology presented could also be applied to other rare diseases.

Animal models provide important insights into disease mechanism. Damian Smedley, of the Wellcome Trust Sanger Institute, Cambridge, spoke on “Disease gene identification through phenome-wide, cross-species comparisons.” Using a semantic approach for comparing clinical and model organism phenotypes, candidate disease-gene associations can be identified. This is especially useful for diseases where no existing gene associations or potential pathways have been identified that would be used in previous gene prioritization approaches. Phenotype comparisons were used to predict 72 novel disease models from the Mouse Genome Informatics (MGI) database and 22 from the Zebrafish Model Organism Database. Validation of the first 34 based predictions by analyzing genetic variants in human disease samples has now been performed. For 24%, pathogenic mutations were identified, 29% had compelling evidence for pathogenic mutations, and 32% did not have coding mutations but the possibility remains of mutations affecting regulatory regions. Additional phenotypic information emerging from high-throughput projects such as the International Mouse Phenotyping Consortium and the Zebrafish Mutation Project will make this approach even more powerful for the identification of disease producing genes using animal model systems.

Another example of how powerful a formal phenotypic analysis can be was provided by Paul Schofield of the Department of Physiology, Development and Neuroscience at the University of Cambridge, who presented “An informatic analysis of human overgrowth disorders.” Mutations in a disparate range of genes give rise to closely related phenotypes of the overgrowth disorders. These include Sotos–Weaver syndrome, Beckwith–Wiedemann syndrome, and Simpson–Golabi–Behmel syndrome. A set of 53 human somatic overgrowth syndromes was phenotypically annotated with the HPO using online sources, OMIM and Orphanet, and the published literature. Only half of the syndromes are associated with specific genes and mutations and so both gene-based and phenotype-based analyses were undertaken. Using the gene ontology, annotations for the known genes, overrepresentation of gene function and pathways were examined using Ontologizer and Ingenuity Pathway Analysis and a number of pathways associated with overgrowth disorders were identified. Deregulation of both the PI3 kinase pathway and the MAPK pathways were found to be important in control of growth. Intriguingly negative regulation within these pathways appears to underlie the overgrowth phenotypes. Clustering analysis based on phenotype alone demonstrated the existence of phenotype clusters, which overlapped closely with the genes involved in networks discovered using overrepresentation analysis, supporting the existence of modular phenotypes in the somatic overgrowth disorders representing the disruption of the same gene network.

Berit Kerner, of the Semel Institute for Neuroscience and Human Behavior at the University of California, Los Angeles, spoke on “The challenges of phenomics in psychiatry.” Over fifty million individuals worldwide suffer from severe mental illness and about 50% of them will attempt suicide in their lifetime, creating an enormous societal burden. Because the diagnostic boundaries of psychiatric disorders are not well defined, research on genotype–phenotype correlations remains challenging despite strong evidence for heritability in twin studies. To demonstrate phenotype complexities, bipolar disorder was chosen as an example. New initiatives at the NIMH now plan to tackle these challenges: (1) through the Center for Collaborative Genomic Studies on Mental Disorders (www.nimhgenetics.org/access_data_biomaterial.php), phenotype data and biomaterial of thousands of individuals diagnosed with psychiatric disorders have been collected over the past 20 years providing valuable resources for the research community worldwide; (2) the Research Domain Criteria Project (www.nimh.nih.gov/research-funding/rdoc.shtml) explores new ways of classifying psychopathology based on dimensions of observable behavior and neurobiological measures in an attempt to get closer to the pathophysiology of psychiatric disorders; (3) the Human Connectome project (www.humanconnectome.org) focuses on individual differences in brain circuitry and its relationship to behavior, as well as genetic and environmental risk factors. A more precise definition of individual differences in disease phenotypes together with genome-wide sequencing efforts will provide new inside into the pathophysiology of mental disorders that will eventually lead to more effective treatment.

Tudor Groza, of the School of Information Technology and Electrical Engineering at the University of Queensland, spoke on the “Challenges and lessons learned from decomposing skeletal phenotype descriptions.” Over the course of the last few years, there has been a significant amount of research performed on ontology-based formalization of phenotypes, with the HPO representing the pioneer of this effort. In order to develop a deeper understanding, in particular, of the skeletal phenotypes, an automated processing pipeline has been constructed. This enables the logical decomposition of phenotype descriptions based on their intrinsic quality/entity associations using Foundational Model of Anatomy (FMA) and Phenotype and Trait Ontology (PATO) concepts. Applying this pipeline in an experimental study on the HPO ``Abnormalities of the skeletal system” entities has revealed several findings, such as, differences in terminological structure between HPO and FMA or missing FMA and PATO concepts. These findings have led to a series of open challenges that include the representation and testing of computational definitions for complex phenotype classes or the definition and representation of abnormalities that involve relationships between anatomical elements and of abnormalities of spatial, functional, and nonfunctional properties of anatomical elements (e.g., mineral density, movement, and angles).

Most of the association studies using genetic variation have focused on single nucleotide variants, but CNVs are also an important source of functional genetic variation. On this, Sandra Doelken, of the Institute of Medical Genetics and Human Genetics at Charité Universitätsmedizin Berlin, spoke on “Phenotypic overlap in the contribution of individual genes to CNV pathogenicity.” Although it is thought that CNVs play an important role in genetic disease, how to determine the effect of a given CNV on the phenotype is difficult. In some cases, CNV pathogenicity is thought to be related to altered dosage of the genes containing CNVs. Interspecies comparisons can play a role in determining the effects of CNVs in these situations. To do this computationally, one needs ontologies containing logical definitions of phenotypes of the different species along with automated reasoning and integration of data [Doelken et al., 2013]. There are currently 1,843 human genes with phenotypes that are HPO-coded, 6,535 mouse genes with phenotypes that MGI/MPO-coded, and 1,625 coded zebrafish genes. This method was applied to 27 CNV associated disorders and identified 802 gene–phenotype associations. 431 associations were identified based solely on model organism phenotype data. Future goals include expanding structured data representation of human disease manifestations, further integrating model organism data, and developing additional algorithms for analysis.

Phenotype Resources and Tools

The final session, entitled “Phenotype Resources and Tools,” was chaired by Richard Cotton of the Human Variome Project, University of Melbourne. Generating new knowledge from rich database resources where phenotypic information is coupled with genetic variants requires sophisticated analytical tools. Many types of evidence need to be integrated in order to generate new findings including, for example, statistical evidence. One way to improve statistical power is to combine cohorts with similar phenotypes, but there are problems with this. Data standards need to be developed and implemented to facilitate the system interoperability that is crucial for fully integrated types of analysis. To this end, Tim Beck of the Department of Genetics, University of Leicester spoke on “Strategies, standards and databases devised by the GEN2PHEN project for effective exchange and use of phenotype and related information.” A goal of GEN2PHEN (www.gen2phen.org) is to unify human and model organism genetic variation databases by providing holistic views of genotype-to-phenotype data. A number of different databases and resources that can be used to combine genotype and phenotype information were presented in this talk. GWAS Central (www.gwascentral.org) is a comprehensive database for the comparative interrogation of summary-level genome-wide association study data. The ability to search via HUGO Gene Nomenclature Committee gene symbols, dbSNP reference SNP ID numbers, and phenotypes (diseases, traits, medical signs, and symptoms) annotated with Medical Subject Headings and HPO terms, ensures a rich source of standardized genotype and phenotype data is placed at the fingertips of researchers. In addition, GEN2PHEN are coordinating the I4Health initiative (www.i4health.eu) that promotes the need for intelligent integration of research and clinical information to successfully provide individualized healthcare. One of the many natural consequences of this integration work will be to make larger cohorts from multiple studies available for analysis.

The use of animal models to understand Mendelian and complex disorders is critical to unraveling the molecular and biochemical basis of these diseases. A database that integrates such information for animal models of human disease is invaluable for discovery. Cynthia Smith of the Jackson Laboratory, Bar Harbor, spoke on “The Mammalian Phenotype Ontology: a tool for annotating complex phenotypes.” The Mammalian Phenotype Ontology (MP), a structured vocabulary of phenotype terms, is the cornerstone for semantic consistency of phenotype descriptions arising from multiple resources, including published literature and large-project pipeline projects. The MP includes nearly 9,000 terms and continues to expand as needed by curators annotating new data, by mutagenesis projects designing new phenotypic tests, and by systematic review by domain experts. Structured phenotype data at MGI (www.informatics.jax.org) includes all mutation types, from those of spontaneous origin, to chemically induced, targeted, gene-trapped, and so on. In addition, associations to OMIM diseases are made when a mouse model is reported. Alignment of mouse and human phenotypes will aid in identifying new mouse genetic models and human disease genes for further study.

John Carey, from the University of Utah School of Medicine, provided additional insight in his talk “Elements of Morphology: Standard terminology and definitions of phenotypic variations.” Interoperability between databases requires a precise definition of phenotypes, which is dependent on a clear consensus of phenotype features. The Elements of Morphology Working Group, established in 2004, was created to meet this need (elementsofmorphology.nih.gov). To begin this process, 34 clinical geneticists met at two formal working meetings. The goal was to create a standard set of terms used to define different phenotypes of the head and neck. They established over 400 standard definitions for morphology of the face and skull. The results from these discussions have been summarized in six articles published in the January 2009 issue of the American Journal of Medical Genetics Part A [Allanson et al., 2009]. As a result of this process, an instrument was developed for describing and defining phenotypes associated with dysmorphology to be used by researchers and clinicians. By using set terms, comparisons of phenotypes between individuals are much more meaningful and easier to accomplish. As this process expands, this will be a model to be used for all phenotypes.

Central to the formation of databases documenting phenotypic and genetic information are the patients themselves. Beverly Searle, of UK-based Unique, a rare chromosomal disease support group (www.rarechromo.org), described how families have demonstrated since 1984 their willingness to submit to Unique's confidential database detailed phenotypes of affected relatives. With the recent development of professional initiatives to centralize the global collection of very large data sets, professionals must heed patients’ views if they are to be successful. Patients recognize the huge benefits that might arise from these initiatives, not least the stimulus to research and improved treatments and therapies. However, with deeper phenotyping linked to more detailed genetic information, patients have voiced concerns. For example, there needs to be a sensible balance between the rights and choices of the individual and public benefit; extent of access to data must be carefully considered; data security and privacy and clear, properly informed consent processes are essential; different views about how the data are used should be respected. It is critical that professionals work with patients and patients’ organizations to “get it right.”

A major source of phenotypic data is clinicians. Marta Girdea, from the University of Toronto, Ontario, spoke for the Finding of Rare Diseases Genes in Canada (FORGE Canada) on ways to ensure accurate phenotyping data is captured in her talk “Easy phenotyping for clinical and research use.” Precise phenotyping can bring great benefits to both clinical practice and medical research, especially in the case of rare disorders. The best time to obtain phenotypic traits from a patient is during a clinical exam, and there are usually two alternatives for entering phenotypic information: free form notes or predefined terms. The use of free form notes can result in the same trait being described in multiple ways, including typos, which make comparison between individuals, either in the same cohort or between cohorts, problematic. Alternatively, predefined terms can solve this ambiguity but may not be flexible enough for describing new traits. A new clinical exam and diagnosis assistant (PhenoTips) is being created to make it easier to describe clinical phenotypes. The computer-based assistant helps the clinician through the multiple steps of creating a phenotype description of a rare disorder. The application features an error-tolerant search engine for phenotypic descriptions, and also allows for recording relevant phenotypes, which were not observed, and which can be helpful for differential diagnosis. Other functionalities include the possibility to record standard measurements with automatic plotting of growth curves, automatic selection of phenotypes based on abnormal measurements, and instant searches in OMIM for disorders that match the selected phenotypic description. A demo version is now available (phenotips.cs.toronto.edu).

The formation of a database for hereditary nonpolyposis colon cancer (HNPCC, Lynch Syndrome) in Denmark has been ongoing for over 20 years. Inge Bernstein, of the Hvidovre University Hospital, Copenhagen, spoke on this database in her talk “The Danish HNPCC-system: A multidisciplinary effort to support individual healthcare in HNPCC Hereditary Non-polyposis Colon Cancer.” The aim of this database is to improve prognosis for HNPCC and colorectal cancer (CRC) by allowing for the identification of persons at risk though family risk determinations and mutation identification. The database includes dropdown menu options to insure that standard terminology is used for all patients and provide reporters with feedback on risk estimates. To date, 5,211 families have been identified. Once a mutation for an individual is identified or evaluation of family phenotype is estimated as high risk, letters are sent out to unaffected family members to suggest that they be tested for mutations predisposing them for CRC or enrolled in colon screening. Over 3,400 reports have so far been sent electronically. Early identification and screening for individuals predisposed for CRC should decrease mortality associated with this disease.

Barend Mons, of the Leiden University Medical Center and Netherlands Bioinformatics Center, The Netherlands, closed the meeting with his talk “Data integration via RDF/semantic web.” The move toward larger research cohorts, along with larger databases, is making access of the raw data for other investigators very difficult. Datasets are becoming too massive to be published in the traditional way. A description of the database in a publication narrative does not provide enough information and even supplementary data usually does not provide enough information to recreate the original dataset. To compound this problem, links to published databases are often broken. A better way to store and make available data used in research is needed. Another issue is the concept of nanopublications (nanopub.org). A nanopublication is the smallest unit of publishable information containing assertion and provenance. The goal of nanopublications is to incentivize the placement of data (that would otherwise be impossible to formally publish in a journal) into the public domain while providing linkage of the data to the contributor. This would include human genome variants and SNPs associated with a specific phenotype. By providing an opportunity for investigators to be acknowledged for experimental findings, small (nano) bits of information, such as single mutations associated with a disease, can be made available to researchers and clinicians. This will hopefully make a much richer source of human genetic variation available to investigators for further analysis.

Where to Go from Here?

Great strides have been made in DNA sequencing. Soon, whole genome sequencing will be possible even in small research laboratories, providing access to almost all the genetic variation within an individual, although there are still many challenges associated with data storage, analysis, and interpretation. To best use this rich source of genetic information, access to accurate detailed clinical information is needed to construct robust phenotypes for association studies. Furthermore, phenotypes need to be the same between different cohorts when combined or used as validation cohorts. This meeting highlighted several important challenges and potential solutions to improve phenotyping and enhance genetic research and healthcare. Some of these are core problems that will apply to all fields, and some will be disease specific. Standardization of terms and interoperability of systems is critical. For example, the HPO was used by many speakers and presented as one such standard, and perhaps can serve as a template for the formation of other standardized systems.

The HVP is dedicated to creating databases of clinical information throughout the world, that are interoperable, to allow for the use of multiple cohorts in a single analysis. To this end, the HVP is helping promote the building of the infrastructure for these databases that allows for this interoperability. An additional goal is to get clinicians to code this information into databases in a way that maximizes the interoperability of phenotypic information between investigative studies throughout the world. Barriers to extracting clinical data for systematic analysis include issues of consent, ethics (especially individual identifiability in rare diseases), the ease of data collection, and the pressures of time and resources on clinicians and users of automated data collection systems. The HVP is dedicated to bringing solutions to these problems so that genetic variation associated with common disease phenotypes can be identified and therapies needed to reduce the suffering associated with these diseases can be created and individualized. Upcoming meetings of the HVP phenotype interest group in conjunction with the International Rare Diseases Research Consortium, Orphanet, and the European Society of Human Genetics are well positioned to advance these goals.

It seems fair to say that this community has recognized the importance of making phenotypic data resources interoperable in order to address the challenges and exploit the possibilities of new technologies in DNA sequencing. In particular, there would seem to be broad agreement that we need to improve the computational representation of the human phenotype to 1) accelerate the discovery of the remaining Mendelian disease genes, 2) improve diagnostics in human genetics, and 3) prepare the way for the development of “a New Taxonomy of human disease based on molecular biology” that has been proposed as one of the pillars of the “precision medicine” of the future.

The key issues for the coming years include improving interoperability between the many domains of human medicine and biomedical research. For instance, SNOMED CT (Systematized Nomenclature Of Medicine Clinical Terms) is a comprehensive ontology of medical terms providing codes, terms, synonyms and definitions covering diseases, findings, procedures, and many other entities that is intended for use in settings including electronic health records (EHR), but is not especially designed for scientific research in any one field of human disease. On the other hand, many of the resources discussed at this meeting, including the Human Phenotype Ontology, are linked to resources such as genetic and model-organisms data that are important for research, but are not intended to be used in hospital IT systems. The community has begun efforts to increase interoperability among resources for the human phenotype by agreeing upon a core set of definitions and terms and supplying cross-references between different terminologies and ontologies.

Another major challenge to achieving interoperability is technical in nature. Centers all over the world are now performing whole-exome sequencing on patients with rare diseases, but often any one center will be caring for only one patient or family with a given rare disease. On the other hand, filtering exome data for genes with rare, potentially pathogenic sequence variants generally results in tens or even hundreds of candidate genes. Since the identification of multiple unrelated persons with the same clinical syndrome and a mutation in the same gene is a prerequisite for the certain identification of a novel disease gene, it is necessary to play a sort of matching game and connect researchers who may be studying the same syndrome in order to find the second and the third patient with a mutation in a novel disease gene. To date, however, there is no agreed-upon computational exchange standard, such as the MIAME standard for microarray data, which would allow this. Currently, a number of groups are working on this and a meeting is planned directly after the International Rare Diseases Research Consortium (IRDiRC) Conference in Dublin in April of 2013 to work on these issues.

Finally, all progress must be based on a deep understanding of the ethical and privacy issues involved. An exome sequence can be compared to a molecular fingerprint that makes an individual uniquely identifiable, and thus databases with exome and phenotype data could easily expose study subjects to discrimination unless appropriate defensive arrangements are in place. Some novel solutions to this have already been developed. For example, Cafe Variome (Beck et al., 2012) provides a flexible portal to announce, discover and then automate the tailored sharing of mutations to support particular research and clinical groups, with only core, non-identifiable data being submitted to the central repository. Databases and technologies for phenotype data can be, and will need to be, co-managed by such systems, in order to allow maximum progress in our understanding of rare diseases. This will greatly improve our ability to treat affected patients while simultaneously respecting the rights and privacy of these individuals.

Acknowledgments

The 2012 Meeting of the Human Variome Project was chaired by Peter Robinson from Charité Universitätsmedizin Berlin. The sessions were chaired by Peter Robinson, Marc Greenblatt, and Richard Cotton.

Ancillary