Gene variant databases or locus specific databases (LSDBs) are repositories that contain variation information for genes and proteins that have disease relevance. There are already close to 2,000 LSDBs [Mitropoulou et al., 2010], with more than one database for some genes and diseases. More recently, a database has been created for every gene associated with a Mendelian disorder by the authors of the LOVD LSDB system [Fokkema et al., 2011] with support from the GEN2PHEN project (http://www.gen2phen.org/). The eventual goal is to create a database for every gene in the genome. Variation information is available also from other resources, however, LSDBs are usually the primary and most trusted variation information source as they are curated and maintained by experts in the gene and disease.
The variation database field does not have established International Organization for Standardization (ISO) certified standards. Instead, there are some de facto standards, such as EMBL/GenBank/DDBJ reference sequences and numerous established best practices, standard procedures, guidelines, recommendations, and ontologies. All these help to present and provide the information in a consistent format readable to humans and suitable for computational analyses. The Human Variome Project (HVP, http://www.humanvariomeproject.org/), the Human Genome Variation Society (HGVS, http://www.hgvs.org/), and the GEN2PHEN project are working toward standardized variation and pathogenicity data presentation. One of the major goals of GEN2PHEN has been to establish standards to describe and unify genotype and phenotype information and, crucially, for data exchange. HVP is working toward publicizing and promoting global standards and guidelines for the variation field.
A number of guidelines and recommendations have been published over the years, the most recent by Kohonen-Corish et al. . Recommendations have been made, for example, for LSDB content [Claustres et al., 2002; Horaitis and Cotton, 2005; Scriver et al., 1999, 2000], ethics [Cotton et al., 2005; Povey et al., 2010], data collection [Cotton et al., 2007, 2009], somatic variations [Olivier et al., 2009], interpretation and reporting of variants [Plon et al., 2008; Richards et al., 2008], curation [Cotton et al., 2008; Celli et al., 2011] data sharing [den Dunnen et al., 2009], and nomenclature, [den Dunnen and Antonarakis, 2000; Taschner and den Dunnen, 2011]. These instructions have been useful, however, some of them are already partly outdated and others are scattered throughout a number of publications. For a new database curator, it would be almost impossible to grasp all the concepts and their updates. Here, our goal is to provide guidance and discuss best practices and standards for each stage related to LSDB creation, curation, maintenance, and distribution and work toward global acceptance.
In this article, we cover guidelines for establishing LSDBs: in the accompanying publication (Celli et al., 2011) experiences of LSDB curation are described.
Establishment and Maintenance of LSDBs
New LSDBs should be created only if there is not yet a database for a particular gene/disease or the existing database is outdated and obsolete. There is no reason to duplicate efforts. Existing databases are listed on the HGVS Website at http://www.hgvs.org/dblist/dblist.html, and the GEN2PHEN listing at http://www.gen2phen.org/data/lsdbs. The general link “GeneSymbol.lovd.nl” will automatically direct to any LSDB or list choices when there is more than one database for a given gene. In addition, the WAVe (Web Analysis of the Variome) [Lopes et al., 2011] service at http://bioinformatics.ua.pt/WAVe integrates variation information from numerous sources and can be used to identify databases as well as variations, as does DRUMS (Disease-Related Unique Mutation Search Engine) (http://www.scbit.org/glif).
It is strongly recommended that the scientists and clinicians in a field should work together with the hope that they can form unified consortia [Smith and Vihinen, 1996]. It is important that LSDBs be created in a systematic way that allows integration with other resources and reuse of data. To avoid reinventing the wheel, new curators should take advantage of existing recommendations and guidelines as well as the computer tools, software, and systems available.
The process and steps of establishing and curating an LSDB are outlined in Figure 1. Briefly, the reference gene/transcript/protein sequences and the official name of the gene should first be identified. Then the data items to be included in the database need to be decided. The database model should comply with recommendations in the field and this can be best achieved by using one of the existing LSDB maintenance systems that implement the current recommendations. An additional benefit is that the curator then does not have to tackle the numerous practical details of database modeling and implementation and development of tools for distribution, analysis, and collection of data. These steps are already solved in the tools that are used for the majority of existing LSDBs.
Database curation comprises the collection and coding of data and, in carrying out these tasks, ethical concerns have to be addressed. Data should be collected from the literature and from the research and clinical community. Ideally, a team of experts is needed for the curation to share the workload and to provide complementary expertise. Data should be entered into the database in a systematic fashion with DNA, RNA, and protein level descriptions of the variation in accordance with HGVS nomenclature. Genotype, variation effects, and pathogenicity should be described in as detailed a fashion as possible, consistent with patient confidentiality, using ontologies whenever possible. Database goals and contents need to be documented. Databases should be distributed and open access is strongly supported to ensure that the exchange of the data with other databases is facilitated. This will be ever more important in the future for the seamless integration of genotype and phenotype data in knowledge bases.
These steps (Fig. 1) may seem difficult, but the technical details have been solved in existing database systems and we will provide instructions for handling the rest. In the following sections, each of these steps is discussed in detail.
Genes and Sequences
An LSDB can contain information about variations in one or more genes and diseases. Some LSDBs may be themed on particular diseases or on genes that are related evolutionarily. Either way, it is important to use systematic gene names and symbols that are implemented and approved by the HUGO Gene Nomenclature Committee [Wain et al., 2002] (HGNC http://www.genenames.org/). As legacy aliases for gene names and abbreviations may exist, only HGNC names provide an unequivocal way of naming genes.
Many of the current LSDBs are named by the affected gene with “base” added to the end, for example, ADA base for ADA gene variations in adenosine deaminase deficiency.
Variations are reported in relation to reference sequences on three levels—DNA, RNA, and protein (when applicable). Thus, three exactly corresponding reference sequences must be defined. The recently developed Locus Reference Genomic (LRG) reference sequence standard [Dalgleish et al., 2010] (http://www.lrg-sequence.org/) is the recommended choice for reference sequences as LRGs are stable and will not need to be updated when new builds of the human genome are released. LRG records comprise a genomic DNA sequence mapped onto the current genome build, all alternatively spliced mRNA sequences required to report sequence variation along with their corresponding protein sequences. In addition, LRGs afford the facility to label exons according to legacy practices even when this deviates from strict sequential numbering and also allows numbering of amino acids using nonstandard schemes. To avoid confusion, it may be beneficial to record variants in terms of both legacy and systematic numbers with respect to affected amino acids and exons, even when particular numbering schemes are well established. LRGs are created and curated at the European Bioinformatics Institute (EBI) and the National Center for Biotechnology Information (NCBI) using input from curators of existing LSDBs. When an LRG is not yet available a RefSeqGene (http://www.ncbi.nlm.nih.gov/projects/refseq/rsg) entry might be a good start for a request to generate an LRG, and to use in the LSDB in the meantime. LRGs do not have versions, but it is essential when using non-LRG reference sequences that both the accession number and version be reported at all times.
Database Content, Data Models, and Database Management Systems
A database is the easiest, most versatile, and the recommended way for the collection, maintenance, analysis, and distribution of variation information particularly as it can be used to divide the workload between experts. Most LSDBs use the relational database model and should follow accepted data models to facilitate data exchange and integration with other databases. In addition, they may utilize software developed for the predictive analysis of variation information. Whatever the software solution that is adopted, the database organization should support most common user scenarios.
Decide on the data items to be included especially if you want to include clinical information. Minimal requirements for data items to be included in all LSDBs for all genes and diseases have been developed by the GEN2PHEN project, but are not yet finalized. In addition to the minimal data set, other data items can be included though such clinical data will be disease specific and vary greatly between disorders. Minimum Information About Somatic Mutations (MIASM) [Olivier et al., 2009] defines what must be recorded for somatic variations (Box 1). It would be preferable to include some clinical features for patients, however, the details have to be decided based on availability of data and curatorial resources and be consistent with protecting patient confidentiality. The more data items there are, the more versatile the database can be for individual users. The drawback is that it can be very tedious and time-consuming to collect the information.
Box 1. Minimum Information About Somatic Mutations (MIASM) [Olivier et al., 2008]
Nature and number of tumor samples analyzed
Geographic location or name of hospital
Quality control procedure
Assessment of somatic origin
Sample: topography, morphology, nature, source
To facilitate connection to other services, LSDB information must be stored in a systematic way. HVP recommends that databases be built according to GEN2PHEN Variation Object Model (Vario-OM) and Variation Markup Language (VarioML) (http://www.varioml.org). Currently, three systems are in the process of amending their data model including LOVD (http://www.lovd.nl/) [Fokkema et al., 2011], MUTbase (http://bioinf.uta.fi/MUTbase/) [Riikonen and Vihinen, 1999], and UMD (http://www.umd.be/) [Beroud et al., 2000]. LOVD is the most widely used system followed by MUTbase, which has been used almost exclusively for primary immunodeficiencies. The UMD-based registries contain an abundance of clinical information making them more like knowledge bases. These database management systems include all the necessary tools for submission, maintenance, and distribution of variation information. LSDBs should have links to other services, such as PubMed, dbSNP, OMIM, and PDB, in addition to sequence databases. More recently, the MutaDATABASE system has been described [Bale et al., 2011] that will provide LSDBs for all human genes. It is not yet clear what data model is being used and how the features compare with longer established LSDB systems. What does differentiate MutaDATABASE is that it is a single centralized database, unlike the others described here that can each be installed locally and customized according to individual curators' needs.
The database must be implemented on a secure server with curators being the only persons having access to modifying the contents. The LSDB software, the operating system, the back-end database, and the Web server should not contain security vulnerabilities. It is important that all components of the server service should be patched regularly to ensure the long-term integrity of the server and the information stored in it.
There are a number of ethical issues related to LSDBs [Povey et al., 2010] (Box 2). Databases should take into account specific communities and cultures as well as vulnerable persons. Submitters need consent from patients/parents/carers prior to submitting information to registries and the wording of the consent has been discussed in the article above. All patient data in LSDBs should be anonymized before submission and release, and LSDBs should have an ethics review board to handle all ethics-related topics relevant to the database.
For more details about anonymization see Celli et al. (2011). The recommendation to limit links to other databases is to protect individuals when data exist in several LSDBs. The relevance of the limitation should be evaluated case by case. The issue of transfer of data to genome browsers is intended to guarantee the recognition of the data to LSDB curators.
Box 2. Major Points to Consider Pertaining to Ethical Issues [Kohonen-Corish et al., 2010, Povey et al., 2010]
Clarify the main purpose of the particular database.
Define database policy with respect to sources of data.
Take specific communities/cultures into account.
Take vulnerable persons into account.
Create an ethics oversight committee.
Remove identifying information before submission to database.
Add further protection of confidentiality if needed.
Allow no further disclosure without consent.
Make provision for removal of data from the database.
Be cautious in response to requests to an LSDB curator for a private opinion.
Limit links to other LSDBs.
Consider carefully the transfer of publicly available data from LSDBs to genome browsers.
Guidelines for LSDB curation are provided in the accompanying article (Celli et al., 2011).
Genotype, Phenotype, and Pathogenicity
Many LSDBs were initially just collections of variations in particular disease-related genes, but subsequently many of them have included additional data items especially with respect to the clinical features and phenotype. In the future, the amount of clinical data is expected to increase as it will be an essential requirement for investigating genotype–phenotype correlations and for performing prospective and retrospective studies.
For the recording of variants, HGVS nomenclature (http://www.hgvs.org/mutnomen/) [Taschner and den Dunnen, 2011] should be followed. Variants should be described, where applicable, at the DNA, RNA, and protein levels. In most instances, the effects at the RNA and protein level are predictions, and should be indicated as such. HGVS nomenclature is widely used in the majority of LSDBs, but it cannot be used for explaining all possible variations. New examples of variation that might be useful for developing the nomenclature can be suggested by contacting MutNomen@JohanDenDunnen.nl.
The amount of clinical information that is presented should be balanced between aspiration and available resources. The more data that are included the better; however, data items that are rarely recorded should be avoided. Most LSDBs that contain clinical information have taken a pragmatic approach and contain the most essential and typical features related to the condition. One of the most valuable use cases for LSDBs is to find whether a specific variant has been shown to have functional consequences: is it disease causing, or not? Pathogenicity of the variants should therefore be clearly indicated. Several LSDBs contain information for variants not thought to be disease causing as well as demonstrated disease-causing variants.
Curators should address the question of when a variant should be considered to be disease causing. This is not always self-evident even in monogenic diseases and much more complicated in complex diseases, for example, cancers. For certain disorders, guidelines for pathogenicity assessment have been published [Richards et al., 2008; Tavtigian et al., 2008]. The criteria used to determine pathogenicity should be made available on the database Website. The most recent recommendations have been discussed in the report of the 3rd HVP Meeting [Kohonen-Corish et al., 2010].
Use of Ontologies
Ontologies are controlled vocabularies in which terms have defined relations. Their systematic structure allows easy data retrieval and analysis as well as unequivocal and systematic description of different features and properties pertinent to the variants and the individuals carrying them. In this respect, the Human Phenotype Ontology (HPO) (http://www.human-phenotype-ontology.org/) has been developed for describing phenotypic abnormalities. Other ontologies are available for malformations [Allanson et al., 2009].
For describing the effects of variation on DNA, RNA, and/or protein sequence, structure, function, interaction, and other features, Variation Ontology (VariO) (http://variationontology.org/) can be used. The Phenotype and Genotype Experiment Object Model (PaGE-OM) [Brookes et al., 2009] (http://www.pageom.org/Home.html) has been developed and is currently being developed further by the GEN2PHEN project as Phenotype Object Model (Pheno-OM).
The first ontologies have now started to be used in LSDBs. It is recommended that newly established LSDBs should use them from beginning as annotation afterwards will be more laborious.
Distribution and Sharing of Data
Preferably, LSDBs should be Web-based and made publicly available with the strong recommendation that users should not have to register or pay for access to the data. As there are many uses for LSDBs, some of which are programmatic, the requirement for registration or payment of a fee could form a barrier to full use and access to the data.
Databases should have user-friendly Web pages with search capabilities so that the users can easily find what they are looking for. In addition, the LSDB Website should include documentation of the service as well as a user guide of help pages.
Database contents can, and should, be shared with other services and users. HGVS members have published guidelines for such information sharing [den Dunnen et al., 2009] and the minimal information that LSDBs should share is listed in Box 3. By sharing data in this way, database curators receive credit for their hard work and core data remain freely available. This contrasts with the longer list of information needed for variant data submission to central databases shown in Box 4 [Kohonen-Corish et al., 2010]. Appropr-
Box 3. Minimal LSDB Information that Should be Shared [Kohonen-Corish et al., 2010]
Web address of the LSDB.
Contact details of the database curator(s).
HGNC gene name, reference sequences used (accession number and version if not an LRG).
Description of published sequence variants at the DNA level using HGVS recommendations and, when available, the description as in the original report, dbSNP ID, and/or MIM number of variant.
LSDB-specific identifier(s) to link directly to the specific variant(s) in the LSDB.
Box 4. Information Needed to Accompany Variant Submission to Central Databases [Kohonen-Corish et al., 2010]
Mandatory minimum information:
HGVS name or equivalent database accession for variant including alleles
Reference sequence used (accession and version if not an LRG)
Submitter identification (ORCID ID or equivalent)
Number of observations (or defaults to 1)
Optional recommended fields:
LSDB identifier, diagnostic laboratory name or submitter name (preferably from a future LSDB Registry)
URL for attribution
DOI of publication (or PubMed uid)
Individual or sample ID
Individual sample ethnicity
Individual gender or sample gender frequency
Individual genotype or sample genotype frequency
Individual or sample phenotype as “ontology name: ontology value” e.g. MIM number
iate application of microattribution (Anonymous, 2008) provides the mechanism whereby data producers receive credit for their data curation and sharing efforts, even in the absence of traditional published articles.
A database, its contents, organization, purpose, and data items have to be properly documented on the Website. This is important as LSDB users need to know what is available and what is not and what to expect from the database. Database policy should state the purpose and practices of the LSDB, including how data are coded and handled. An example of a database policy statement is presented in Figure 2. To protect material rights, a copyright statement, usually for the department, university, or other entity, should be displayed, and a disclaimer is important for limiting or excluding liability. Examples of liability and copyright statements are presented in Figure 3. We advise that a lawyer or solicitor should be consulted in the relevant country about the wording of these statements. Database curators typically do not take any responsibility for the use of the data; instead they aim to provide as reliable information as possible. Instructions about how to cite the database are also useful. In addition to the documentation on the database Website, the database curators should inform HGVS (http://www.centralmutations.org/LsdbAdd.php) and GEN2PHEN (http://www.gen2phen.org/data/lsdbs) of the database.
As indicated above, there are a number of recommendations and standards for LSDB curators that should be followed. A summary of the most important recommendations is in Box 5. By following these principles, high-quality resources can be generated, and by using one of the existing database management systems a number of these issues are already taken care of. Still there are some areas where further development is needed and therefore the curators must follow developments in the LSDB field, for which HGVS and HVP provide excellent information dissemination support. HVP is working to release standards based on the guidelines presented in this article. The most probable future development in the field will be the creation of patient-centric genetic variant database systems which are better suited to the task of recording data from a more clinical point of view than is possible with current gene-centric LSDBs.
Box 5. Summary of Essential Guidelines to Follow when Establishing LSDBs
Start new LSDB only when needed
Use existing standards
HGNC gene names
HGVS variation nomenclature
LRG sequences, if not available RefSeq and UniProt
GEN2PHEN database model
ontologies (HPO, PAGE-OM, VariO)
Select existing database management system
Nominate ethics committee and take care of ethical issues
inform submitters of the need for consent
Describe the variant, its effect(s) and phenotype in a systematic way
Document the purpose and policy of the database
Ensure data security
Inform HGVS and GEN2PHEN of new databases
Distribute the database freely on the Internet
Be prepared for long commitment and continued updating
LSDB-related standards are for facilitating systematic and detailed recording of cases; on the other hand, they are also necessary for interoperability and the sharing of data. In the future, more bioinformatics tools will be needed to handle data, perform analyses, and make predictions based on information in LSDBs. The use of a standard format for data storage will make it possible to use the same tools for all or as many LSDBs as possible. Ontologies will be widely used. Examples in the LSDB field are HPO and VariO. The benefit of ontologies is that computational searches and analyses will be more powerful as each observable fact is always described with a systematic and appropriate term. Easy to use tools are needed for fast and reliable annotation and use of ontologies. Additionally, it is likely that digital identifiers will be used for entire LSDBs and individual items of content as well as the data submitters. However, these systems are still under development and there is not yet consensus on which system to use.
For LSDBs, data collection remains the most problematic element, especially in accessing the vast amount of unpublished data generated daily in routine diagnostics laboratories worldwide. Automated flow of information from diagnostic laboratories needs to be established to facilitate the exchange of data with LSDBs with minimal effort. Cafe Variome (http://www.cafevariome.org/) is a tool that has been developed to address this need and it is already available to use. Another system for collecting information from clinical laboratories is MutaDATABASE [Bale et al., 2011], which however at this moment does not follow the recommendations discussed here.
Genomic sequencing projects, such as The 1000 Genomes Project (http://www.1000genomes.org/) produce huge variation datasets, which are stored in dedicated services and often in dbSNP, a database for genetic variations (http://www.ncbi.nlm.nih.gov/projects/SNP/). Where possible, LSDB variant entries should be cross-referenced to dbSNP variants using rs numbers, though the majority of dbSNP variants do not have disease relevance. Additionally, exome sequencing data for patients with particular diseases will soon become available in larger quantities and will need to be stored in LSDBs.
To be most useful, LSDBs should be as complete as possible. Ideally, all variations in all genes effecting human disease should be collected and this is the goal of the HVP. Complete collection is easier where there is smaller number of patients and variants that is the case in very rare disorders. For more common disorders, one strategy is being implemented by Human Mutation. The authors of “Mutation Updates” (for e.g., see [Boria et al., 2010]) are encouraged to contact labs worldwide, that are working on the gene in question, to request that they submit mutations in return for authorship. Nature Genetics also has this policy under the banner of microattribution [Giardine et al., 2011]. This provides a publication incentive to submit data to the public domain for databasing.
Description of pathogenicity related to genetic variation remains one of the major focal areas for LSDBs. There are still a number of open questions including, in many cases, the systematic definition of pathogenicity and use of ontologies. HPO is already available, but more detailed descriptions would be needed in many cases. An additional aspect of pathogenicity is the determination of whether or not a particular variant is disease related, and to what degree. Currently, detailed guidelines are only available for a few diseases with respect to assessing and describing the pathogenicity.
To guarantee the quality of LSDBs, some sort of rating system needs to be implemented as suggested by HVP [Kohonen-Corish et al., 2010]. This would allow users to easily assess how reliable the database is considered to be, how well it follows the recommendations and standards in the field, and how well it covers the current knowledge in the field. Once such ratings have been established, LSDB curators will be informed of the evaluation criteria and ratings can be reviewed every few years.
LSDB maintenance is an on-going task and therefore those involved have to be prepared to make a long-term commitment to provide updates at frequent intervals as new data become available. This task will increase and change in the future as exome and full-genome projects generate increasing amounts of variation data.
In conclusion, several standards and guidelines have been developed and, by following them, curators can generate high-quality information sources that are also compatible with other services and software. New guidelines will need to be developed in the near future as new data, for example, from next generation sequencing facilities become available in large quantities. However, the principles described here are intended to remain valid for the foreseeable future. Curators need to follow the field and implement new features as guidelines are released, but LSDB curation is a rewarding and important activity that serves the clinical and research community in numerous ways.
The authors want to thank Sigrid Juselius Foundation, Biocenter Finland and Competitive Research Funding of the Tampere University Hospital for support.