Disease-specific databases: Why we need them and some recommendations from the Human Variome Project Meeting, May 28, 2011


  • How to Cite this Article: Howard HJ, Beaudet A, Gil-da-Silva Lopes V, Lyne M, Suthers G, Van den Akker P, Wertheim-Tysarowska K, Willems P, Macrae F. 2012. Disease-specific databases: Why we need them and some recommendations from the Human Variome Project Meeting, May 28, 2011. Am J Med Genet Part A 158A: 2763–2766.


The need for Locus-Specific Databases, with disease-specific experts and curators, is an essential ingredient in a process to enable the benefits of the advances in sequencing and mutational analysis to be realized across the genome. Next generation sequencing provides both astounding opportunities and challenges, especially for genetic counsellors. An approach coordinated at a genome wide, international level, supported by well-organized disease-specific respected organizations is a model most likely to be successful, but committed resourceful professionals working in local poorly resourced environments can make valuable contributions that can grow. Bioinformatic tools to sift and integrate multiple domains of information are being developed, and play a major part in meeting the challenges. Regulation of providers, including a requirement for them to submit mutational information to central databases, also should assist to reach the goals needed to realize the opportunities. There is also a need to agree on governance of Locus-Specific Databases (LSDBs) at an international level, and for adequate international funding to support this need, to ensure humanity reaps the benefits of the current molecular genetic revolution. The Human Variome Project offers this, working also with the other major initiatives with similar objectives. This report concludes with Recommendations for the Human Variome Project stemming from the presentations and discussions at the meeting. © 2012 Wiley Periodicals, Inc.


Sharing of information relating to variants across the genome is an ever increasing necessity. The best use of genomic data from sequencing and other technologies can only come with the collective experience and wisdom of the scientific community as a whole. This requires a dedicated effort to align all disciplines and investigators who can contribute. The vision of the Human Variome Project is to document variation in all genes across all diseases in all countries and cultures. The foundations of this approach have been established, and meetings such as this are important to set the agenda and gather the confidence and interest of the scientific community to reach this humanitarian goal.

The lead approach is to establish Disease-Specific Databases containing to genotype and phenotype information relating to sometimes multiple genes which may alone or together predispose to disease. In the main, these databases to date deal with single disease disorders. But do we really need them?

This question was addressed at a satellite meeting of the European Society for Human Genetics meeting held in Amsterdam, Netherlands on May 28, 2011.


Arthur Beaudet began by providing a clear picture of a future of genome wide sequencing and carrier testing, with all the challenge that this data explosion brings. Already it is evident that some Direct-to-Consumer testing are not well quality controlled, claiming sensitivities, and specificities that are either misleading, or likely to be inaccurate. He noted that this has attracted the attention of the FDA and rightly so. Some DNA laboratories do provide good service backed by conservative interpretation, but others recently have been less cautious.

Stephen Kingsmore has published his experience offering screening for 489 mostly recessive but severe disorders using hybrid capture and next generation sequencing. The application of this technology either for selective mating, pre-conception parental testing, pre-implantation genetic diagnosis, maternal plasma screening for fetal DNA (now being reported), prenatal testing of the fetus, all need careful community and parental thought in the context of the severity and treatability of the genetic predisposition targeted. The clinical decisions around such offers need careful thought in the context of local cultural norms and preferences. It is thought that each of us will carry at least three disease-carrying mutations. However, to interpret DNA sequencing information, we need comprehensive sequencing data on unaffected individuals to provide information on normal ethnic-specific genetic variation at different loci.

These innovations show clearly the need for ethnically stratified genotype and phenotype data to be available in LSDBs to enable interpretation of the avalanche of information that confronts us. But will our Healthcare systems provide the funding? Peter Taschner pointed out that the cost of such screening falls far below the cost of lifetime support for a person affected by a severe genetic condition.


Vera Gil-da-Silva-Lopes presented the Brazilian efforts to establish a LSDB on orofacial clefts, describing the difficulties of working within a region where genetic education and services are poorly resourced and widely dispersed. Her project, a non-government, nation-wide, hospital-based, prospective, voluntary initiative is a collaborative model for LSDBs in developing countries. She reported useful genetic epidemiological data which forms the basis for monitoring changes over time, and possible environmental exposures, as well as capturing relevant aspects of healthcare delivery.


Peter van den Akker presented the International Dystrophic Epidermolysis Bullosa Patient Registry, an online database of patient data with their COL7A1 mutations (http://www.deb-central.org), and Katarzyna Wertheim-Tysarowska who presented a similar database (http://www.col7.info), with user friendly features. Most families have private mutations, making the task of LSDBs highly relevant. A merge of these excellent initiatives is planned and will be advantageous to both groups and their patients. The tools now available to construct the models behind LSDBs are becoming sophisticated. Peter van den Akker explained the utility of open-source MOLGENIS software (http://www.molgenis.org/wiki/MolgenisDownload), which is also being utilized by Gen2Phen.


The presentation by Finlay Macrae described the 2008 merge of databases to form the current International Society for Gastrointestinal Hereditary Tumours (InSiGHT) mismatch repair gene database: the original ICG HNPCC mismatch repair gene database, and two other databases focussing on the published literature around mutations in the mismatch repair genes (M. Woods) and functional assays of mismatch repair variants (R. Sijmons). Since this time a number of significant milestones have been achieved by InSiGHT:

  • Formal collaboration with the Human Variome Project in 2007.

  • German HNPCC consortium data uploaded.

  • Diagnostic lab DNA variants submitted.

  • Continuing work on calibrating of functional assays of missense variants (R. Hofstra).

  • In silico analyses (S. Tavtigian).

  • Quantitative phenotype dataset drafted.

  • Interpretation processes implemented.

  • Appointment of the Hicks Foundation InSiGHT full time curator.

There are now over 25,000 variant submissions on the database, of which nearly 5,000 are unique. The website attracts over 20,000 page hits per month.

The InSiGHT database serves as a model for the Human Variome Project. It provides a guide for a consortium to be formed around a disease and how the governance, scientific pursuits, and database can be constructed. A strong international professional and democratic consortium (as is InSiGHT) backing the LSDB has been identified as a key element in the success of such an international venture. InSiGHT has chosen to incorporate to allow it to engage in Memoranda of Understanding and Data Transfer Agreements where requested, also offering it some protection from the legal consequences, through adverse health outcomes, of misclassification of variants on its databases. In this process, registration as a charity in UK has been granted, allowing tax deductibility of donations. The importance of the LSDB curator's role in the validation of existing and new data, checking integrity of submissions, nomenclature, duplicate entries, the promoting of submissions internationally, and assembling published literature relating to variants was emphasized.

InSiGHT has developed an Interpretation Committee to classify variants of uncertain significance through a transparent process with pre-defined criteria for classification into the IARC 5 class system. This committee now has 45 members from around the globe, allowing the responsibility to be shared. Variants to be discussed at its bimonthly meetings are advertised to InSiGHT members, requesting any unpublished information not already submitted to the database to be forwarded to the curator who will then assemble all information for the committee to discuss and classify the variants. This is then entered as a one-line entry on the database, with the supporting reasons for its classification through annotation.


Patrick Willems presented a different federated way to collect data; MutaDATABASE is a standardized, centralized, open access database of variants leading to human genetic disease.

He reflected that the problems we face now are:

  • Novel DNA variants are not being made public in scientific publications or public databases.

  • Many remain unclassified variants (VUS) which is a challenge for genetic counselling.

Willems estimated that only about 10% of all variation information is being submitted, published, or shared. His vision is an automated submission process to a single database that holds the information of all genes/diseases. MutaReporter is a software tool that will be used by the labs to submit, curate and share data.

This data will then be shared with MutaDatabase which will provide:

  • General information on human disease genes.

  • Overviews of all variants in these genes.

  • Overviews of diseases associated with these genes.

  • Tables of:

  • cDNA sequence

  • Amino acid sequence

  • Exon–intron sequences

  • Disease variants

  • Figures of:

  • Cytogenetic localization

  • Physical map

  • Genomic structure

  • Variants

  • Easy submission of molecular and clinical information, the latter using PhenoExplorer software with tick boxes to record features.

Associated software concepts—Mutacircles, Mutareviews, and Mutareporter—are tools which will provide a way to communicate between interested individuals and the wider community. The database would be supported by access to HGMD and DbSNP.

A common theme across many of the talks was their financial sustainability. The model that Patrick Willems has in place is of an open variant database but a license is required to purchase the reporting and the interrogation tool MutaReporter. Curators, who need to use MutaReporter for data submission, would be provided free licenses. The market rate otherwise is projected to be 1000 USD (700 Euro).


Mike Lyne for MetabolicMine, showed an integrated web resource of data and tools which supports the wider metabolic disease research community. This free, creative search tool, based on the successful “InterMine” database platform which integrates genomic data from fruit fly (flyMine.org) and rodent (ratMine.org). MetabolicMine covers the genomics, genetics, and proteomics of common metabolic diseases integrating data sets on genes, proteins, interactions, orthologs, pathways, ontologies, diseases, GWAS, and SNPs, providing the tools for their exploration. MetabolicMine's underlying technology can reach information across a range of life including zebrafish, yeast, and rat avoiding the need to visit several sites and formats. It can:

  • Examine properties of a collection, for example, genes to pathways, SNPs in GWAS experiments.

  • Investigate gene function.

  • Evaluate individual sequence variations.

  • Conduct genome region searches, for example, find genes or features in a chromosome region.

  • Work with old/diverse identifiers.

  • Export in a range of formats.

It has particular utility for researchers wanting to investigate the features of regions identified as hits in GWAS studies.


All the above initiatives are dependent on one thing: submission of data. But how do we get laboratories to submit their information? This problem was a key theme throughout all the presentations.

Graeme Suthers addressed this in his presentation “Carrots and sticks—what are the best options?” Can we make submission a requirement for Laboratory accreditation (the stick approach) and move the language in national accreditation standards from “labs should submit” to “labs must submit” accurate genotype and phenotype data to central databases? Carrots would include robust software design, the development of international standards and certification, and a ranking system for LSDB's based on the value added, that is, consistent interpretations and accuracy. Incentives for phenotypic data collection require more work globally as this could be the most problematic. Disease consortia (such as InSiGHT) provide the best approach to support this. Patient consent was another issue that could impede submission of data. If analyses of genotypic and phenotypic data are essential for producing an accurate test result, that is, essential for patient wellbeing, then it could be considered unethical not to submit the data. On the other hand, if the data are to be used for research, that is, not essential for patient wellbeing, then it would require patient consent.

Graeme also pointed out the need to educate and even accredit the community using LSDBs to achieve maximum use. He covered nomenclature, data accuracy (both genotypic and phenotypic), privacy issues, challenges around updating of databases, ascertainment biases, transparent interpretation processes, privacy issues around the bi-directional information flow needed to inform patients represented in databases with insights gathered from new data, and reporting standards.




  • (1)The HVP should identify all inherited disease related organizations, establish if they have LSDBs and invite them to collaborate and contribute to the HVP disease-specific database Council. The formation of disease-specific interest groups is a tried and true model for establishing LSDBs. Whether this be through dedicated professionals working in developing countries with no government support such as so ably demonstrated by the Brazilian orofacial cleft LSDB or international organizational efforts such as InSiGHT, all approaches should be encouraged.
  • (2)Where two or more disease or gene-specific databases exist, merging should be encouraged, or at least cross referenced at the variant level.
  • (3)The Human Variome Project establish guidelines for disease-specific database content and establish an accreditation system, recognizing the quality of the databases against Human Variome Project standards.
  • (4)The Human Variome Project should develop a catalogue of meta-analysis tools such as is emerging in Gen2Phen, MutaReporter, Metabolicmine and ClinVar at the NCBI.
  • (5)The Human Variome Project needs to assist in developing and implementing tools for phenotype capture such as PhenExplorer and the Human Phenotype Ontology, testing their application to LSDBs.
  • (6)The Human Variome project should develop training for USERS of databases.

The Human Variome Project should monitor developments in Ethical approval of data submission to LSDBs, as centralization of data can be a block to the process of data submission for some, perhaps many, laboratories. Distinctions between this clinical use and research use should be clear in Data Use policy of databases.

The HVP act as a clearing house for Acceptable Data Use policies. The website Creative Commons can assist this process.