Locus-specific database domain and data content analysis: evolution and content maturation toward clinical usea

Authors

  • Christina Mitropoulou,

    1. Erasmus MC, Faculty of Medicine and Health Sciences, MGC-Department of Cell Biology and Genetics, Rotterdam, The Netherlands
    Search for more papers by this author
  • Adam J. Webb,

    1. Department of Genetics, University of Leicester, Leicester, United Kingdom
    Search for more papers by this author
  • Konstantinos Mitropoulos,

    1. Erasmus MC, Faculty of Medicine and Health Sciences, MGC-Department of Cell Biology and Genetics, Rotterdam, The Netherlands
    Search for more papers by this author
  • Anthony J. Brookes,

    1. Department of Genetics, University of Leicester, Leicester, United Kingdom
    Search for more papers by this author
  • George P. Patrinos

    Corresponding author
    1. Erasmus MC, Faculty of Medicine and Health Sciences, MGC-Department of Cell Biology and Genetics, Rotterdam, The Netherlands
    2. University of Patras, School of Health Sciences, Department of Pharmacy, Patras, Greece
    • University of Patras, School of Health Sciences, Department of Pharmacy, University Campus, Rion, GR-26504, Patras, Greece
    Search for more papers by this author

  • a

    Communicated by Mauno Vihinen

Abstract

Genetic variation databases have become indispensable in many areas of health care. In addition, more and more experts are depositing published and unpublished disease-causing variants of particular genes into locus-specific databases (LSDBs). Some of these databases contain such extensive information that they have become known as knowledge bases. Here, we analyzed 1,188 LSDBs and their content for the presence or absence of 44 content criteria related to database features (general presentation, locus-specific information, database structure) and data content (data collection, summary table of variants, database querying). Our analyses revealed that several elements have helped to advance the field and reduce data heterogeneity, such as the development of specialized database management systems and the creation of data querying tools. We also identified a number of deficiencies, namely, the lack of detailed disease and phenotypic descriptions for each genetic variant and links to relevant patient organizations, which, if addressed, would allow LSDBs to better serve the clinical genetics community. We propose a structure, based on LSDBs and closely related repositories (namely, clinical genetics databases), which would contribute to a federated genetic variation browser and also allow the maintenance of variation data. Hum Mutat 31:1–8, 2010. © 2010 Wiley-Liss, Inc.

Introduction

Recent years have witnessed a remarkable increase in the identification of underlying genetic defects leading to both common and rare inherited disorders. This increase may be attributed to the rapid development and improving economies of high-throughput genomic variation detection and sequencing technologies, supported by growing numbers and complexities of gene variation databases. Such databases are repositories in which allelic variations are catalogued and described within specific genes. Existing databases can generally be assigned to one of three broad categories: general (core or centralized) databases; locus-specific databases (LSDBs), recording variation data for specific genes; and National/Ethnic mutation databases (NEMDBs) [Patrinos and Brookes, 2005].

LSDBs benefit from rigorous expert curation, often coordinated by consortia of collaborating researchers with scientific expertise, in particular genes, such as the HbVar database of hemoglobin variants and thalassemia mutations [Hardison et al., 2002; see also Claustres et al., 2002]. These resources contain up to 50% unpublished variations with, usually, thorough phenotypic descriptions. Consequently, LSDBs are extremely useful tools, contributing toward the identification of causative mutations, providing information about phenotypic patterns associated with a specific mutation, and enabling researchers to define an optimal strategy for mutation detection [Patrinos and Brookes, 2005]. The latter reason, in particular, explains the recent rapid growth in the number of LSDBs, with such databases now available for hundreds to thousands of human genes (sometimes with more than one database per gene). These repositories aim to promote data submission and form well maintained, accurate and up-to-date data sources.

Currently, however, the LSDB field suffers from a number of serious deficiencies. First of all, the majority of LSDBs presently available have been constructed from a plethora of different technologies and designs. Most of the resulting databases are both unsophisticated in implementation and small in scale. Several database management systems (DBMSs), such as the Leiden Open Variation Database (LOVD: http://www.lovd.nl) [Fokkema et al., 2005], the Universal Mutation Database (UMD: http://www.umd.be) [Béroud et al., 2005], and MUTbase [Riikonen and Vihinen, 1999] attempt to harmonize the current data content heterogeneity by providing off-the-shelf software for LSDB development and curation. These three systems, however, have very different designs. The LSDB field is therefore attempting to grow and mature in response to the urgent needs that exist, while struggling to overcome deficiencies such as data content heterogeneity.

Less than a decade ago, Claustres and coworkers [2002] examined the comparatively few LSDBs that existed on the Web at that time in an attempt to define optimal content scope and suggest ways for data curation. However, the rapid growth of LSDBs and the vast heterogeneity that presently characterizes the field dictate the need for a thorough data-content and LSDB-domain analysis. This will allow comprehensive mapping of standards (i.e., data models, ontology options, on which the existing LSDBs are based) and provide insight into ways the field should further develop.

Here, we report our findings from a thorough domain analysis of 1,188 existing LSDBs currently available through the Internet. These data are compared with previous findings and, based on these results, recommendations are made toward implementation of LSDBs for use in a clinical and genetic laboratory setting.

Methods

Between September 2008 and September 2009 we examined 1,728 LSDBs based on a list amalgamated from the following sources: (1) Human Genome Variation Society Website (HGVS; http://www.hgvs.org/dblist/glsdb.html; 727 LSDBs), which contains an extensive and routinely updated list of LSDBs with their URLs; (2) the Leiden Open Variation Database Website (LOVD; http://www.lovd.nl/2.0/index_list.php; 853 LSDBs), which contains LSDBs developed using the LOVD database management software (DBMS) [Fokkema et al., 2005]; (3) the Universal Mutation Database website (UMD; http://www.umd.be; 26 LSDBs), which contains LSDBs developed using the UMD DBMS [Béroud et al., 2005]; and (4) the Website of the Bioinformatics group of the University of Tampere, Institute of Biomedical Technology (http://bioinf.uta.fi/base_root/mutation_databases_list.php; 122 LSDBs), which contains LSDBs developed using the MUTbase DBMS [Riikonen and Vihinen, 1999]. We found considerable overlap between these lists, with 640 LSDBs represented on more than one of the above Websites. After accounting for this redundancy, a total 1,188 LSDBs remained for our downstream analysis.

These LSDBs were examined for the presence or absence of 44 content criteria pertaining to: (1) Database analysis, namely, (a) general presentation, (b) locus-specific information, and (c) database structure, and (2) Data content, namely, (a) data collection, (b) variant information table, and (c) database querying.

These criteria (summarized in Table 1) were selected to ensure an objective evaluation of the various LSDBs and their content. LSDBs were scored for these criteria using a binary scoring mode: “0,” absence/no; and “1,” presence/yes. Criteria-based subjective judgments by the evaluator (e.g., ease of use, LSDB design, etc.) were excluded from our study. We also excluded checking for “hit counters” from our study, as these do not provide a reliable estimate of Website traffic. Hit counters do not necessarily distinguish between unique and returning visitors, or between genuine visitors and automated visitors such as search engine robots. Implementation of such hit counters may vary between LSDBs making direct comparison inappropriate.

Table 1. Database Contents and Data Content Criteria Upon Which the Entire LSDB Domain Analysis has been Performed
  • a

    aAuthor from published report or personal communication.

1. Database features2. Data content
1a. General presentation2a. Data collection
Explanation of content and aimCollection via literature
Useful linksContact curators
Links to OMIMOnline submission
Links to HGMDYear of the last LSDB update
Links to other LSDBs 
Database description published2b. Variation table
CopyrightComplete reference list
DisclaimerSummary phenotypic description
Language other than EnglishDetailed phenotypic description
 Links to references
1b. Locus-specific informationCross-reference with other databases
Reference sequenceRestriction enzyme change reported
Information about diseaseEthnic group
List of associationsMutation frequency
Chromosome locationDetection method
Information on gene 
Protein function2c. Database querying
Nomenclature followedQuerying tool(s)
 Field: variation name
1c. Database structureField: Gene region
Summary table listing all variationsField: Codon number
Downloadable variation tableField: Author's namea
Static HTML pages (no search option)Field: Phenotype
Flat-file databaseField: Ethnic group
Relational databaseField: Geographic location
Variation visualization toolOther fields for querying
Use of a DBMS 

We have made our extensive listing of LSDBs, complete with all criteria described in this article, available via the GEN2PHEN Knowledge Centre (http://www.gen2phen.org). The list can be browsed, filtered, and searched at http://www.gen2phen.org/data/lsdbs. We will continue to maintain and extend this listing, and encourage visitors to contribute by submitting additional LSDBs. The listing can also be accessed in plain text and Microsoft Excel formats. We also offer a number of machine-readable formats such as Atom, allowing gene-specific listings to retrieved and utilized by third-party Websites.

Results

The 1,188 LSDBs that were included in our analysis covered 967 different genes, with 103 LSDBs being redundant for the genes they represented. Specifically, 93 genes were represented in two different LSDBs, from which 11 involved the same DBMS (e.g., LOVD) but different installation (e.g., LSDBs for both the ASS1 and GPM6B genes are available in both the Leiden and Penn State University installations). In these cases, the number of variant alleles was different. Also, for nine genes (AP3B1, BEST1, CDH23, FOXL2, HPS1, L1CAM, LMNA, LYST, and MYO7A), there were three different LSDBs, whereas for the TP53 gene there were four different LSDBs available on the Web.

In addition to the 1,188 LSDBs included in our analysis, we also identified 33 gene-specific variation resources that did not qualify as LSDBs (i.e., nondatabase resources such as simple downloadable PDF files). An additional 21 LSDBs unavailable to the general public, that is, either password-protected or located in members-only areas (e.g., http://www.euroglycanet.org), were also excluded from our analysis. A further nine databases were no longer available. By way of comparison, the Human Gene Mutation Database (HGMD; http://www.hgmd.cf.ac.uk) [Stenson et al., 2003]) lists 2,689 genes in its public version (and 3,739 in its Professional 2010.1 Release) containing at least one variation.

Criteria Examined

The presence or absence of 44 content criteria were examined in the 1,188 LSDBs, pertaining to: (1) database analysis (general presentation, locus-specific information, and database structure), and (2) data content (data collection, variant information table, and database querying).

Database analysis

General presentation: The majority of LSDBs (86.5%) had a home page explaining the database contents and aim. A user guide and a minimal set of relevant external links were generally available for the user to access additional information (Fig. 1A). Important links included HGMD (55.6%), OMIM (Online Mendelian Inheritance in Man) [Amberger et al., 2009] (76.5%) and other useful links (91.2%; GenBank/EMBL, HGVS, etc.), but only 0.3% included links to other LSDBs. A copyright notice and a guide to citation were displayed in 81.1% of LSDBs, whereas 78.9% also displayed a disclaimer notice.

Figure 1.

Overview of the database features criteria, namely, LSDB general presentation (A), locus-specific information (B), and database structure (C).

In approximately 25% of the cases studied, the database description was published in a peer-reviewed scientific journal. Finally, only 0.5% of the LSDBs were presented in a language other than English, namely, French in all cases.

Locus-specific information: A number of databases provide supplementary information such as a reference sequence (87.3%), chromosomal location (74.5%) or other information (82.8%) on the gene of interest, making these registries readily accessible to physicians and scientists from many fields. Only 21.1% provided information on the relevant disease(s) and even fewer (2.4%) provided information on protein function. Proper variation nomenclature [den Dunnen and Antonarakis, 2001] was followed in 76.4% of LSDBs. Finally, only 19.4% of LSDBs displayed links to patient associations or to Websites with clinical content (Fig. 1B).

Database structure: Contrary to previous observations, few LSDBs (11.2%) were structured as flat-files containing a number of fields for each entry. The majority of LSDBs were relational (Structured Query Language [SQL]-based; 82.6%). However, a substantial number (195, 16.4%) are still based on static HTML pages (rather than pages dynamically generated on request using SQL database queries). This absolute number is similar to the previous study (191, 73%) [Claustres et al., 2002] despite the considerable number of new LSDBs included in this study. This suggests that these static HTML LSDBs represent older databases that have not been upgraded, whereas newer databases have opted for more sophisticated solutions. Furthermore, we found that over half (51%) of these static HTML databases have not been updated since 2002. A complete table listing all variations was available in 87.3% of LSDBs, from which only a small portion included downloadable formats (10.5%) that were sometimes difficult to download.

Very few LSDBs showed mutation maps and few added graphical displays, including dynamic graphing tools, depicting the location of variations throughout the gene (or protein) sequence (13.5%). The SERPINA1 LSDB [Zaimidou et al., 2009], for example, employed the VariVis tool [Smith and Cotton, 2008], whereas MUTbase-based LSDBs employed a built-in visualization tool [Riikonen and Vihinen, 1999].

As there are no uniformly accepted object model standards for LSDB design yet, a substantial number of LSDBs are based on ad hoc custom-built platforms, resulting in data content heterogeneity. It is encouraging though that 66.6% of LSDBs are based on an established DBMS geared for locus-specific variation data, namely, LOVD (853 LSBDs), MUTbase (122 LSDBs), and UMD (26 LSDBs). General use of such platforms contributes significantly to data uniformity (Fig. 1C).

A large number of LSDBs are updated frequently. In particular, 691 (58%) of the LSDBs analyzed were updated in 2009, whereas 900 (76%) were updated during the last 2 years (Fig. 2). Fortunately, only 59 LSDBs (5.4%) were truly outdated, that is, updated in 2000 or earlier, whereas few others were updated between 2001 and 2007.

Figure 2.

Year of the last LSDB update.

Data content

Data collection and submission: Individual LSDB entries usually correspond to a variation in a single patient or data that are collectively reported from a group of patients bearing the same variation. This, generally manually curated, information is based on data derived from both published literature (99.9% of LSDBs) and direct submissions of unpublished variations by researchers to the LSDB curators (98.1% of LSDBs). Such contributions are made either by e-mail or in prestructured data submission forms. Such forms were generally made available upon request or could be downloaded from the corresponding LSDB. Direct online submission, using dedicated online submission forms in password-protected interfaces, was available in 75% of the cases (Fig. 3A).

Figure 3.

Overview of the data collection criteria, namely, data collection (A), variation table (B), and database querying (C).

Variation table: The majority of LSDBs provide information on the gene of interest, protein function, protein structure, and/or protein sequence alignment. A complete reference list was included in 87.7% of LSDBs, of which 94.8% linked directly to the PubMed literature database. Summary phenotypic descriptions (i.e., pathogenicity: Yes/No, etc.) were available in 74.5% of LSDBs, whereas detailed phenotypic descriptions were available only in 8% of the LSDBs analyzed. The depth of the detailed information depends on the nature of the LSDB and of the resulting phenotype (e.g., cancer, hematological disorders, syndromes, etc.). Only 1.5% of LSDBs linked to corresponding entries in other LSDBs. In addition to the variation listing, many databases provided data fields for associated information, such as variation detection methodology (57.6%), and restriction enzyme changes that are indicative of the presence or absence of a gene variation (65.2%). Although the ethnic or geographic origin of patients was indicated in 69.9% of LSDBs, variation/allele frequency data were only available in 1.6% of LSDBs. This likely reflects the fact that, for many genes, this information is just not known. These data are summarized in Figure 3B.

Database querying: A search engine that allows tailored database querying is one of the main features that distinguish an LSDB from the conventional search functionality of generic genome browsers. In 76.9% of LSDBs, mostly relational databases or those utilizing a DBMS, a search engine was available. These allowed the user to query for variation name (67.1%), gene region (63%), codon number (62.2%), author name (68.9%), phenotype (69.1%), ethnic group (59.8%), or geographic location (58.6%) (Fig. 3C). In 70.9% of LSDBs, additional fields for querying were available, such as cancer type and classification, DNA source, protein domain, etc.

Data quality: As far as data quality is concerned, we have attempted to get some indication on how sparse the data documented in the LSDBs are over some of the key content criteria. We have determined the number of LSDBs that combine information on three main parameters, namely, pathogenicity, reference sequence and adoption of the official HGVS nomenclature. This would be yet another quality indicator for LSDBs, because these parameters are among the most relevant ones for LSDB users. We found that 154 LSDBs (13%) fulfill all three content criteria.

On the other hand, there have been some inconsistencies found regarding gene names that are used in a number of LSDBs. In particular, we found 18 LSDBs that were not using Human Gene Nomenclature (HGNC) symbols, and we made a note of these LSDBs listings on the GEN2PHEN Knowledge Center. All these gene names have been converted to their proper HGNC names and hyperlinked to the corresponding LSDBs. The full list of the changed gene names, with aliases or withdrawn names in brackets, are as follows: BEST1 (VMD2), BRCA2 (FANCD1), BRIP1 (FANCJ), CDK16 (PCTK1), CLRN1 (USH3A), DCAF12L1 (WDR40B), DCAF12L2 (WDR40C), DCAF8L1 (WDR42B), ELANE (ELA2), GIGYF2 (PARK11), HMGN5 (NSBP1), KDM5C (JARID1C), KDM6A (UTX), NOS2 (NOS2A), PALB2 (FANCN), PEX2 (PXMP3), PNP (NP), TAB3 (MAP3K7IP3).

Discussion

The elucidation of the human genome sequence has revolutionized our ability to explore how genes cause disease and other phenotypes. Sadly, the resulting flood of primary genomic information is not yet being managed or utilized as effectively as it should be, due, not least, to the lack of a sufficiently organized and mature database infrastructure. Two international consortia have emerged to assist in addressing these issues, namely, the Human Variome Project (HVP; http://www.humanvariomeproject.org) [Horaitis et al., 2007; Kaput et al., 2009] and the GEN2PHEN project (http://www.gen2phen.org), an integrated project funded by the European Commission.

Here, we have reported our results from a thorough domain analysis of the LSDB field, motivated by the goals of the GEN2PHEN project. This effort constitutes a formal “requirements analysis” that would: (1) contribute guidelines upon which the LSDB field can be further evolved, (2) formalize the data models and the nomenclature systems being utilized by the entire LSDB community, and (3) bring maximum synergy with groups involved in the LSDB field. Actual data models in current use were fully documented and compared with previous data at the time that the field was just starting to grow [Claustres et al., 2002]. Our overall goal was to provide important supporting material for defining the basis on which the LSDB field would not only evolve towards unifying genetic variation databases in one or more research-oriented central genome variation browsers, but also to mature toward its implementation in the clinical environment such that it would be beneficial to medical practitioners, bioscientists and, ultimately, patients.

Evolution of the LSDB Domain: Strengths and Pitfalls

We documented, in detail, the content of 1,188 unique LSDBs in an effort to represent the type and depth of information that is currently provided in such databases, mostly aiming to assess the way the LSDB field is currently evolving. Our approach differs from the one adopted several years ago [Claustres et al., 2002] in the sense that not only has our study included significantly more LSDBs from a multitude of sources but we have also excluded criteria that are more subjective in nature.

A key observation that derives from our analysis is the fact that more and more LSDBs are generated using an, often downloadable, LSDB management system to allow data—both published and unpublished—collection, curation, and storage. In particular, 66.6% of the LSDBs analyzed are built using the LOVD, MUTbase, or the UMD DBMS, versus only 27% in the previous study (Fig. 1C) [Claustres et al., 2002]. Therefore, today there is less data-content heterogeneity than existed 8 years ago. On top of this observation, there is a trend toward interested database curators building their LSDBs by selecting from the existing DBMSs rather than designing their LSDB from scratch.

The use of DBMSs, based on relational databases, enriches the existing LSDBs that are characterized by extensive querying capacity. In our analysis, we found that 76.6% contain extensive querying capacity compared to 35% as was documented previously (Fig. 3C). It is notable that in the previous study 73% of LSBDs consisted of static HTML pages. This number has been significantly decreased today to just 16.4% (Fig. 1C). The use of a DBMS also allows other scientists working in the same field or diagnostic laboratories to directly submit a novel variant to the LSDB curators either by contacting them directly (98.1 vs. 29%) or via a dedicated online submission tool (75 vs. 68%) (Fig. 3A). In the latter case, although the numbers are comparable, the significant increase of the total number of LSDBs upon which this study was performed (almost 5-fold more) shows that an online direct submission tool is incorporated in significantly more LSDBs, mostly those based on the LOVD DBMS. Also, the use of a DBMS has enabled database curators to keep their LSDBs up to date more easily. We found that 58% of existing LSDBs have been updated in 2009, and 76% were updated in the last 2 years. On the other hand, only 13% of LSDBs were updated from 2006 or earlier (Fig. 2).

Documentation of ethnic differences of the various mutant alleles and their corresponding variation frequencies in existing LSDBs shows a striking discrepancy. Although, the number of LSDBs documenting the ethnic background of the mutant allele carrier or patient has improved (69.9%, up from 26%), documentation of variation frequencies sharply decreased (1.6%, down from 18%) (Fig. 3B). This observation can be explained either by the increasing number of National/Ethnic mutation frequency databases (NEMDBs) [Patrinos, 2006; van Baal et al., 2007] or by the lack of relatively recent studies on this topic. Today, although of clear value, journals tend to discourage authors from publishing these data that, otherwise, help stratification of national variation screening efforts. The existence of dedicated journals that encourage submission of this kind of data [Patrinos and Petricoin, 2009] would tackle this problem.

On the other hand, there are a number of shortcomings that have been observed in existing LSDBs. First of all, there have been a few national efforts to document mutant alleles observed in specific population in DBMSs designed for LSDBs rather then NEMDBs, such as the Chinese and the Australian Human Variome Project nodes (http://china-hvp.org/LOVD; https://australianhumanvariomedatabase.arcs.org.au, respectively), both based on the LOVD platform. As a result, there will be several genes that will appear in more than one LOVD installation, such as the BRCA1/2 genes that appear in the Australian and Chinese Human Variome Project databases and Leiden's BRCA-specific LSDB (http://chromium.liacs.nl/LOVD2/cancer), all based on the LOVD installation. This constitutes a major problem, because, if such an approach is adopted by others, there will be no single LSDB to document all mutant alleles of a particular gene, which would confuse the end user as to which LSDB is the most comprehensive. In particular, in the case of the BRCA1/BRCA2 genes, although Leiden's BRCA-specific LSDB seems to be the most comprehensive (deduced from the number of unique variants reported therein), there are several variants that are not documented in this LSDB yet are reported in the Chinese Human Variome Project (e.g., c.66dupA, c.43A>G, etc.; Table 2). The creation of NEMDBs that only document the prevalence of mutant alleles in specific populations and ethnic groups, and their corresponding mutation frequencies, where applicable, could be a solution to this problem [Patrinos, 2006]. Documentation of the genetic basis of several populations has already started, that is, Hellenic [Patrinos et al., 2005], Cypriot, Iranian [Kleanthous et al., 2006], Israeli [Zlotogora et al., 2007], and such efforts have been encouraged by the existence of a dedicated DBMS for NEMDB development and curation (ETHNOS) [van Baal et al., 2010], and funded by the European Commission (FP6-INCO “MEDGENET” [031968], FP6-INFRA “ITHANET” [026539]). A similar shortcoming lies in the fact that there are several redundant LSDBs installed in different locations, such as the LSDBs for the ATR-X (Mental Retardation [http://grenada.lumc.nl/LOVD2/MR/home.php?select_db=ATRX] and Penn State's LOVD copy of HbVar databases [http://lovd.bx.psu.edu/home.php?select_db=ATRX], or the FLCN [Folliculin Mutation Database at http://skingenedatabase.com/home.php?select_db=FLCN] and the European BHD Consortium [EBC] database at https://grenada.lumc.nl/LOVD2/shared1/home.php?select_db=FLCN] genes.

Table 2. Comparison Among Different LOVD Installations Hosting Redundant LSDBs for the BRCA1 and BRCA2 Genes
InstallationLSDBTotal number of unique DNA variants reportedTotal number of variants reported
  1. LSDB not installed (URLs: Chinese HVP node: http://china-hvp.org/LOVD/; URLs: Australian HVP node: https://australianhumanvariomedatabase.arcs.org.au/; BRCA: http://chromium.liacs.nl/LOVD2/cancer/home.php).

Chinese HVP nodeBRCA1134234
 BRCA262113
Australian HVP nodeBRCA100
 BRCA200
BRCABRCA15021,455
 BRCA2485924

Disappointingly, there are several LSDBs that, although installed, have no content. From the total of 502 LSDBs comprising the LOVD installation of the Mental Retardation database (data derived from Tarpey et al. [2009]), as many as 20% (100 LSDBs) had no variants documented (websites assessed November 2009). In this case, it would be highly recommended that curators do not make a new LSDB available unless a critical mass of genomic variation data, pertaining to that particular gene, is assembled. A common problem in the LSDB field is password-protected LSDB access. There are more than 20 LSDBs that require users to have previously registered to access their content.

Our analysis revealed that a number of LSDBs in several installations do not qualify as LSDBs: the International Immunogenetics Information System (http://imgt.cines.fr) that provides no information of DNA variation; the Aldehyde Dehydrogenase Gene Superfamily databases (http://www.aldh.org) that just provides a graphical distribution of the genetic variation in the ALDH genes, without an actual querying interface; or the SNCA LSDB that only documents SNCA gene variation in a portable document file (PDF) format (http://www.med.upatras.gr/athanassiadou/snca_lsdb.pdf).

Last, but not least, the almost exclusive use of English in all LSDBs is another finding that was obvious from our analysis. In particular, there were only six LSDBs that documented their contents in a language other than English, namely, French. This number has been significantly reduced compared to the previous study [Claustres et al., 2002], namely, 0.5% versus 11% (Fig. 1A). This feature may pose difficulties for users with poor literacy in English, particularly patients, hence making LSDBs useful only to English speakers.

Maturation of LSDB Content Toward Clinical Use: The Clinical Genetics Database Concept

Globally, DNA diagnostics laboratories undertake extensive genetic screening in patients affected by a plethora of disease states, but currently little of this extremely valuable primary information is finding its way into the public domain for wider exploitation in anonymized format. Apart from the lack of suitable databases and support software, other problems relate to ethical and legal restrictions, financial structures, and limited manpower. Close partnering between LSDB stakeholders and clinical genetic diagnostics laboratories would therefore be highly desirable, to identify their needs and build some key bioinformatics solutions, not only to assist in data gathering and querying, but also for routinely moving diagnostic laboratory gene variation data into the public domain, and report on this accordingly. From our LSDB domain analysis it seems that current LSDBs are designed more for the research community rather than the clinical/diagnostic laboratory. From the few LSDBs (8%) where detailed phenotypic descriptions are available, even fewer LSDBs combine scientific and diagnostic data on variations with associated information useful for clinicians or students (e.g., population distribution of alleles, haplotype associations), for patients and their families (e.g., treatment, diagnosis, dedicated organizations, or parent associations) and for diagnosticians (technical support in the form of primer sequences and variation detection protocols). The poor documentation on disease and protein function supports this claim.

Current LSDBs do not support data retrieval and transmission between scientific personnel and clinicians working in diagnostic laboratories [Cotton et al., 2009]. To satisfy this requirement, a novel type of registry, that is, a clinical genetics database (CGDB), could be developed for use in clinical diagnostic laboratories that would accommodate complete and reliable genotype and phenotype records on each patient for each novel submitted variation. Such registries would incorporate at least the following main features: a unique identifier for each instance of a genetic variant; detailed genotypic and phenotypic descriptions; and information regarding the person or group that has contributed the records. For the latter feature, a unique ID would define the data contributor, based on existing schemes (e.g., ResearcherID from Thomson Reuters; http://www.researcherid.com). Genotypic data entry should be automated directly from accredited genotypic platforms or Laboratory Integrated Management System (LIMS). For reasons of consistency and data content homogeneity, CGDBs will need to operate under the same, or very similar, DBMSs as the LSDBs, sharing a large number of fields. Prefabricated LSDB and CGDB database software will ideally be available for simple download and installation (i.e., “LS/CGDB-in-a-box” solutions), with these being designed to operate in a manner which enables every LSDB and CGDB group to retain full control and ownership of their database content. Every resulting database can then be integrated with centralized search capabilities, and all search results will provide links back to the original source databases. Such a “federated” LS/CGDB system could also be structured as a virtual NEMDB for a particular population, subsequently contributing data to a central genetic variation genome browser (Fig. 4). The latter browser would then consist of a collection of not only genetic variations but also frequencies ofvariations in different ethnic populations, which is of particular importance for the clinical genetic testing in the developing countries [Patrinos, 2006].

Figure 4.

A proposed structure for a “federated” LS/CGDB system. LSDBs and CGDBs are developed using an off-the-shelf DBMS for each database type that would share the majority of fields. The LSDBs and CGDBs would form a virtual NEMDB for each population that will subsequently contribute data to a central genetic variation browser (see also text for details).

Conclusions and Future Perspectives

Our detailed LSDB domain analysis revealed many elements that have recently contributed in moving the field forward and reducing the overall data heterogeneity, such as the development of specialized DBMS, the existence of data querying tools, and so on. Our analysis also pinpointed a number of deficiencies, namely, lack of detailed disease and phenotypic descriptions for each genetic variant, etc., that, if addressed, would allow LSDBs to better serve the clinical genetics community, patients, their families, and related associations and not just researchers.

LSDBs available today predominantly document genetic variants pertinent to monogenic disorders. Our expectation is that as more data on genotype/phenotype correlations become available from genome-wide association studies—particularly for complex multifactorial disorders—it will become necessary to reorient the way pathogenic sequence variants are documented in many LSDBs. This may involve not only alterations in variant annotation, but could also necessitate a restructuring of existing DBMSs or even the design of novel database schemas. Overall, the development of a proposed federated network of LSDBs and CGDBs would bridge the divide between gene-centric and genome-wide approaches to databasing variation. This will realize the organization of knowledge that will be beneficial for both research and clinical genetics communities, and ultimately provide improvements to public health.

Acknowledgements

We thank Raymond Dalgleish for constructive comments on this manuscript and all GEN2PHEN partners for their feedback. The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 200754—the GEN2PHEN project.

Ancillary

Advertisement