These authors contributed equally to this work.
Observ-OM and Observ-TAB: Universal syntax solutions for the integration, search, and exchange of phenotype and genotype information†
Article first published online: 4 APR 2012
© 2012 Wiley Periodicals, Inc.
Special Issue: Deep Phenotyping for Precision Medicine
Volume 33, Issue 5, pages 867–873, May 2012
How to Cite
Adamusiak, T., Parkinson, H., Muilu, J., Roos, E., van der Velde, K. J., Thorisson, G. A., Byrne, M., Pang, C., Gollapudi, S., Ferretti, V., Hillege, H., Brookes, A. J. and Swertz, M. A. (2012), Observ-OM and Observ-TAB: Universal syntax solutions for the integration, search, and exchange of phenotype and genotype information. Hum. Mutat., 33: 867–873. doi: 10.1002/humu.22070
For the Deep Phenotyping Special Issue
- Issue published online: 13 APR 2012
- Article first published online: 4 APR 2012
- Accepted manuscript online: 13 MAR 2012 12:31PM EST
- Manuscript Accepted: 22 FEB 2012
- Manuscript Received: 21 DEC 2011
- GEN2PHEN, BioSHaRE, and PANACEA (European Commission). Grant Number: FP7-HEALTH contracts 200754, 261433, and 222936
- The BBMRI-NL dynamic bioinformatics infrastructure rainbow project (BBMRI-NL RP-2)
- The BBMRI-FI (Academy of Finland/Biomedinfra), NBIC BioAssist/Biobanking, and the NWO (Rubicon Grant 825.09.008)
- data model;
Genetic and epidemiological research increasingly employs large collections of phenotypic and molecular observation data from high quality human and model organism samples. Standardization efforts have produced a few simple formats for exchange of these various data, but a lightweight and convenient data representation scheme for all data modalities does not exist, hindering successful data integration, such as assignment of mouse models to orphan diseases and phenotypic clustering for pathways. We report a unified system to integrate and compare observation data across experimental projects, disease databases, and clinical biobanks. The core object model (Observ-OM) comprises only four basic concepts to represent any kind of observation: Targets, Features, Protocols (and their Applications), and Values. An easy-to-use file format (Observ-TAB) employs Excel to represent individual and aggregate data in straightforward spreadsheets. The systems have been tested successfully on human biobank, genome-wide association studies, quantitative trait loci, model organism, and patient registry data using the MOLGENIS platform to quickly setup custom data portals. Our system will dramatically lower the barrier for future data sharing and facilitate integrated search across panels and species. All models, formats, documentation, and software are available for free and open source (LGPLv3) at http://www.observ-om.org. Hum Mutat 33:867–873, 2012. © 2012 Wiley Periodicals, Inc.
Modern biomedical research employs high-throughput technologies, large collections of biosamples, complex experimental and analysis protocols, time series and dependent annotations, and results. These elements constitute observations (“phenotypes” in the broadest sense of the word) and contextual information for these observations, all of which need to be shared appropriately as data and metadata. Human and model organism phenotypes (e.g., mouse knockouts) similarly need to be linked, not least for optimal clinical research. Accessing and integrating these phenotypic and molecular data from the thousands of local biobanks, project-specific databases, and broad public repositories in which they are held is hindered, not least, by differences in how the information is structured. Often times, this entails reading free text descriptions where no structured phenotypic representation exists. For example, complex phenotypic data from genome-wide association studies (GWAS) in resources such as the GWAS Central database [Thorisson et al., 2009] and the National Human Genome Research Institute catalogue [Hindorff et al., 2009] are initially collected by labor intensive manual extraction from the literature by database curators, as the phenotypes are not published at source in a standard, exchangeable format. Similarly, the Biobanking and Molecular Resource Infrastructure (BBMRI) biobanking project consolidates 280 specialist European Biobanks and other organizations who preserve millions of samples and associated annotations in diverse local schemas or formats [Yuille et al., 2008]. Other projects populate DNA variants and phenotypes into thousands of locus-specific databases (LSDBs) [Fokkema et al., 2011], and organize high-resolution phenotype findings in numerous model organism databases such as WormBase [Harris et al., 2010] and the Mouse Phenome Database [Grubb et al., 2009]. It is essential that integrated querying across such sets of phenotype information be made possible, and this will require the widespread use of powerful, standardized, user friendly solutions regarding data syntax.
To this end, several EU and national organizations (EU-GEN2PHEN, EU-PANACEA, EU-BioSHaRE, BBMRI-NL, BBMRI.FI, EBI, P3G, NBIC, and LifeLines) joined forces to innovate a lightweight informatics system for storing and sharing observation data. We present this group's progress by first summarizing “Observ-OM”: a minimal model that defines and interrelates the conceptual subcomponents that are intrinsic to any and all kinds of “observation” (which within the realm of biomedicine would be called “phenotypes”). We then describe “Observ-TAB”: a simple and yet powerful spreadsheet format based upon Observ-OM that can be easily adopted by individual researchers, larger consortia, and biobanks without extensive software engineering. Finally, a range of real-world software applications are described, which demonstrate the practical utility of these two solutions.
Observ-OM: A General “Phenotype” Observation Model
Despite the vast complexity and diversity of possible biomedical observations, all such information can, in principle, be described via a straightforward set of four underlying concepts. These are specified by Observ-OM, an “object model” to describe the principle data elements and their relationships, which was drafted as general solution in a series of interactive workshops organized by the GEN2PHEN project, involving experts from model organism and human research communities and employing diverse epidemiological, phenotypic, genetic, and molecular profiling strategies. In parallel, existing object models and tabular formats (TAB), such as MAGE-TAB [Rayner et al., 2006], FuGE-OM [Jones et al., 2007], eXtensible Genotype And Phenotype (XGAP)-OM [Swertz et al., 2010], PaGE-OM [Brookes et al., 2009], OBiBa (http://www.obiba.org), SAIL [Gostev et al., 2011], dbGaP [Mailman et al., 2007], and GenomeEUtwin [Litton et al., 2003] were carefully evaluated and used for guidance on validated successful design patterns and to ensure technical interoperability (see Table 1).
|API||Application programming interface||–|
|BBMRI||EU ESFRI project integrating biobank data across countries||www.bbmri.eu|
|BBMRI.FI||Finnish BBMRI hub||www.bbmri.fi/en|
|BBMRI-NL||The Netherlands BBMRI hub||www.bbmri.nl|
|BioSHaRE||EU biobank data integration project||www.bioshare.eu|
|DataSHaPER||Biobank harmonization platform of Public Population Project in Genomics (P3G)||www.datashaper.org|
|GEN2PHEN||EU project integrating genotype and phenotype data||www.gen2phen.org|
|GUI||Graphical user interface||–|
|GWAS Central||A centralized compilation of summary level findings from genetic association studies||www.gwascentral.org/|
|HL7||Health Level Seven||www.hl7.org|
|MAGE-TAB||A tab-delimited format for representing functional genomics data||www.mged.org/mage-tab|
|MOLGENIS||Software generating infrastructure (databases, APIs, GUIs) for life science projects||www.molgenis.org|
|OBiBa||A suite of open source software for biobanks from P3G||www.obiba.org|
|Object Model||An abstract representation of a domain's concepts, data and relationships between these used to design or generate software.|
|P3G||Public Population Project in Genomics (P3G) is a international consortium for population genomics.||www.p3g.org|
|PaGE-OM||An object model for Genotype and Phenotype data||www.pageom.org|
|PANACEA||Quantitative pathway analysis of natural variation in complex disease signalling in C. elegans|
|Observ-TAB||A tab delimited format derived from Observ-OM used to represent and exchange phenotype data||www.observ-om.org|
|XGAP||XGAP is an open and flexible object model for xQTL, GWL, GWA and mutagenesis data||www.xgap.org|
A universal way to conceptualize phenotypic and other observations came out of this work, entailing just four core concepts: an observation Target, an observable Feature, an observation Protocol (and its Application), and the observed Value (see Fig. 1). The observation Target is the entity to which the phenotype information relates. It can accommodate any kind of object or concept, such as a panel of case/control samples as in the Framingham study in dbGaP [Cupples et al., 2007], a whole mouse strain as in Europhenome [Morgan et al., 2010], or an individual specimen as in MPD [Grubb et al., 2009]. The observable Feature is the character or “question” under consideration (see below for further details on how this question is structured). The observation Protocol describes the method or procedure used to assess the Feature, whether this is a clinical, laboratory, computational, inferential, or statistical process, along with its actual instance of use in Protocol Application. Finally, the observed Value defines the observation itself. For example, an observed Value of “140.1” could be reported for a particular Target = “Total consent group” for Feature = “BLOOD PRESSURE, SYSTOLIC, FIRST READING TAKEN BY PHYSICIAN” using a particular examination Protocol = “FRAMINGHAM Clinic Exam, Original Cohort, Exam 10.” Figure 1 gives further examples of the use of this top-level phenotype model in practice.
A more detailed description of Observ-OM is provided in Supp. Materials and at the project Website (http://www.observ-om.org). Figure 2 elaborates the main elements in the model, specifies the key relationship roles between them by naming these connections, and highlights the importance of suitable ontologies in naming model elements, their attributes, and the data themselves. It also shows circular reflexive relations on all main elements to indicate that each such concept can be related to itself, for example, the Target part of a phenotype record could comprise a group of more than one different Targets, such as samples of four limbs from a mouse individual being evaluated for average degree of stunted growth.
Because the core elements are fairly abstract, additional flavors of Feature and Target can be added via convenient “subclasses” (marked by a triangle in Fig. 2, following object-oriented modeling conventions). Subclasses are identical to the core elements but with additional attributes, such as Panel (is a Target) having attribute “individuals” to group individuals as a cohort, Sample (is a Target) having attribute “source” and “tissue,” and Individual (is a Target) having attributes on “father” and “mother” relationships. Note that subclasses are for convenience only; their information can also be reported using core elements ensuring compatibility with the core model, for example, by using a Protocol that observes a Feature named “father” and a Feature named “mother” instead. Subclasses can be used in place of their “superclass,” that is, Individual and Panel can be a target for an Observed Value.
When using Observ-OM, it is particularly important to have a good understanding of how the Feature class is intended to operate. Specifically, Features are essentially questions about phenotypic or other attributes or traits that can be observed or derived by protocol, and these are split into Characteristics for situations where the objective is simply to record presence or absence (i.e., possible Values are yes/no/unknown) and Measurements if multiple answers (Values) are allowed that may or may not be constrained to a limited set or range of options. In case of a limited set, discrete possible Measurement values are termed a Permitted Value, represented by Characteristic. To illustrate this, Figure 2 gives the example of a specific Characteristic subclass called Variant, that is, a known DNA variant for a Marker at a certain bp position on a particular Chromosome, wherein Reference and Alternative alleles are known and for which the Feature question is therefore “sequence variant observed at this genome location” yes/no/unknown. Other examples we have validated in the contexts of quantitative trait loci (QTL), GWAS, LSDB, and next-generation sequencing (NGS) datasets are listed below as part of the description of various real-world applications.
Observ-TAB: A Simple Spreadsheet for Observation Data
To provide a direct practical means for sharing observation data without the need for complex informatics support, a tabular exchange format was developed (Observ-TAB) using the model concepts described above. Many researchers already use Excel spreadsheets to share “data matrices” with the results of their observation protocols, and Observ-TAB provides an exact fit into this way of working. As illustrated in Figure 3d, in the Observ-TAB format, each data row represents a Target (e.g., the “individual” subjects) and each data column represents a Feature (e.g., “[what is] individual's right ear pressure in daPa”), and each cell is a Value (e.g., “−108.0 daPa”). A particular research investigation, in this case, could comprise many Excel sheets (datasets) to record, for example, genotypes for many markers for a series of cohort members on one sheet, their phenotypes on another sheet, their individual case reports on yet another sheet, and resulting GWAS association P values for multiple markers against different subphenotypes on a final sheet (each datasheet relating to the protocol that was used).
Observ-TAB also readily enables the sharing of metadata on both Targets and Features, and this is a common use case in genetics. It entails simply adding a new Excel sheet every time one wishes to annotate the properties of any of the data (sub)classes in Observ-OM. For example, in Figure 3b and c, the sheet “Individual” lists all the individuals used as Target elements, and sheet “Measurement” lists all the phenotype assessments used as Feature elements (Measurement is a subclass of Feature, used in many of our real-world applications). In this format, the first column defines the unique name of each Target or Feature, and additional columns describe its properties. For example, the first Measurement entry has name = “Rpressure” in the first column, followed by properties description = “middle ear pressure at maximum compliance, right ear,” unit name = “daPa,” and dataType = “string.”
The use of Observ-TAB for metadata recording and sharing also extends to Protocols and Protocol Applications. This is important because precise reporting of which Protocol was used, and exactly how it was used, is critical for proper interpretation of complex phenotypic measurements, which depend on specific equipment or assay parameters that may vary between studies or even batches of observations. To capture the necessary information, the Protocol sheet should include columns to specify how the procedure is applied plus one column to list the Features (Measurements) that will be generated by the Protocol. Naturally, there could be additional Sub-Protocol sheets to give method details for the testing of each particular Feature. As described before, all Protocol Applications are reported in sheets having the same name as the protocol (Fig. 3d). Optionally, additional columns can be added to record meta information, such as date and time relevant to the particular occasion that the Protocol was applied to each test individual (Target). Alternatively, all Protocol Applications and Observed Values can be reported in separate sheets instead (Fig. 3e). Via the unique link from Value to ProtocolApplication, each Protocol Application acts to functionally group primary observations (Values) observed by each single application of a Protocol. This system works equally well for manual assessments (e.g., questionnaires), wet laboratory activities (e.g., sample library preparations), and computational protocols (e.g., “statistical association analysis”).
Where existing ontologies are available, our Observ-OM/TAB system promotes their use by allowing specific references to ontology terms. For example, elevated systolic blood pressure in mammals can be annotated with the MP:0006144 term defined in Mammalian Phenotype Ontology [Smith et al., 2005] as “abnormal increase in the pressure in the arteries as the heart contracts and pumps blood into the arteries.” Use of ontologies enhances searching by traversal of subtypes of disease, and also by use of the synonyms they contain. Use of ontology references promotes integration by ensuring that potentially ambiguous terms are linked to formal definitions. Valid Observ-TAB files are encoded as Excel or as a directory of tab-delimited/comma-separated values files (where each file is representing one data sheet).
Evaluation in Real-World Applications
To date, we have tested the Observ-OM and Observ-TAB system in more than 20 applications ranging from small-scale laboratory observations to high-throughput NGS and GWAS/GWL data from multiple species. In this section, we summarize some of these “apps” to highlight how the model and the format were used, and what data display or processing software were produced. All supporting software was built on the MOLGENIS biosoftware platform, which eases software sharing and reuse [Swertz and Jansen, 2007; Swertz et al., 2010]. A key feature of MOLGENIS is that it can read an object model and then autogenerate a matching database, complete with user interfaces and software interfaces to load Excel and CSV files, R statistics, and Web services. It is thus ideal for our needs, as we wanted to rapidly create subclasses of Features and Targets to easily apply the Observ-OM in each particular domain of use. In MOLGENIS, the core and subclass models are declared in human readable XML files. Full documentation on MOLGENIS model and generator is available at www.molgenis.org, and downloads of the model XML and software currently developed with Observ-OM are available in www.molgenis.org/svn/molgenis_apps/trunk.
App 1: Integration of Public Phenotype Observations from Mouse and Man
This use case was to collect publicly available observation data from human and mouse in dbGaP (www.ncbi.nlm.nih.gov/gap) [Mailman et al., 2007], Europhenome (www.europhenome.org) [Mallon et al., 2008; Morgan et al., 2010], and the Mouse Phenome Database (www.jax.org/phenome) [Grubb et al., 2009] and was used as a “reference implementation” to extensively test Observ-OM within the GEN2PHEN project. For each source, it was straightforward to create a data converter from the source to Observ-TAB format, using the MOLGENIS “CsvReader/Writer” toolbox. For example, in case of the Framingham study, data are provided in dbGaP–XML format, which is difficult to consume by humans. A simple converter from the dbGap–XML to Observ-TAB Excel greatly improved tractability and readability of the data in the loading phase. This first App used the core model (see Fig. 2b): Measurement to define Features having attributes name, description, unit, and dataType; Permitted Value to define categorical value options that are attached to a Measurement; Individual to use as Targets for individual-level observations; and Panel to describe a set of Individual and to use as targets for human-cohort/mouse-strain level observations. Example data can be queried at http://www.ebi.ac.uk/microarray-srv/pheno.
App 2: “xQTL” Research Portal for High-Throughput Observations
The developers of the XGAP system incorporated Observ-OM in their “xQTL workbench” [Arends et al., 2012]: a scalable Web system for the mapping of QTLs in mouse, worm, and plant model populations. Using Observ-OM, they can import/export, query, and analyze observation datasets and QTL profiles on gene expression (eQTL), protein abundance (pQTL), metabolite abundance (mQTL), and phenotype (phQTL) levels. The system is used by EU-PANACEA project for multiomics in Caenorhabditis elegans and by the LifeLines human biobanks phenotypes and genotypes (65,000+ samples), in both cases to provide their researchers with an integrated research portal for all their epidemiological, GWAS, and QTL analysis (e.g., replacing R/qtl with PLINK but using the same core model between them).
The xQTL App added approximately 50 convenient subclasses to Observ-OM such as Marker, SNP, MicroarryProbe, MassPeak, Metabolite, NMR, Clone, Factor, Gene, Tissue, Analysis and DataSet, to name a few. Also, software modules were developed for Observ-OM data including Excel upload wizard, Matrix data browser, and Interface to the R language. To improve scalability when storing millions of gene expression, genotype, and QTL observations, xQTL implemented a binary format for data matrices as well as “drivers” to directly load the values from existing genome variation file formats such as PLINK and VCF. Example data sets can be queried at http://www.xqtl.org.
App 3: “DEB-Central” for Dystrophic Epidermolysis Bullosa Phenotypes and Genotypes
Dystrophic Epidermolysis Bullosa (DEB) is an inherited blistering disorder caused by alterations in the COL7A1 gene and covers a group of several distinctive phenotypes. Although general genotype–phenotype correlation rules have emerged, many exceptions to these rules exist, hindering disease diagnosing and genetic counseling. Therefore, an International DEB registry was set up to elucidate all these cases by collecting a wide variety of DEB genotype–phenotype observations [Van Den Akker et al., 2011].
The DEB-Central App uses Observ-OM to enable researchers to enter new phenotypes by simply adding new Protocols (e.g., “Cutaneous”) for particular Measurements (e.g., hands, feet, arms, proximal body flexures) without the need for any programming. Before we established Observ-OM, all phenotypes were stored using traditional database tables having one column per phenotype. This was a major drawback in that the addition of a new phenotype required a programmer to add a new column to these tables, which was overcome using Observ-OM. We also added convenience subclasses to define genetic variants: MutationGene to store the locus reference genomic [Dalgleish et al., 2010] nucleotide sequence of the gene (LRG_286 at http://www.lrg-sequence.org); Exon to define the intron and exon boundaries; ProteinDomain to define domains on the gene; and Mutation to define cDNA, gDNA, and AA sequence alterations in HGVS notation. Example data can be queried at http://www.deb-central.org.
App 4: “AnimalDB” to Track and Trace of Animal Life Events in Research Laboratories
The Life Sciences Department of Groningen University developed AnimalDB to keep track of animals, locations, breeding, and experiments in and across departments. At the end of the year, the system needs to automatically generate annual reports on animal use, as demanded by government. Animals are stored as Individuals, and Panel is used whenever animals need to be grouped, mostly in the context of breeding (e.g., parent groups and litters). All the animal properties (such as sex, species, genotype) are stored as observed Values. The same holds true for all experimental measurements performed upon the animals. The timestamp attribute of Value is used for experimental measurements that can be assessed multiple times. On each such occasion, a new unique Value is created with a distinct time stamp and belonging to different protocol applications describing the observation event. Example data can be queried at http://www.animaldb.org.
App 5: “SequenceFlow” for Laboratory Work and Analyses in NGS
The “Genome of the Netherlands” (GoNL) is an ambitious NGS project that aims to sequence the genomes of 770 Dutch individuals within BBMRI-NL (http://www.nlgenome.nl). The SequenceFlow App was developed to track the hundreds of Illumina sequence lanes (Targets), dozens of analysis Protocols, and to schedule and run thousands of analysis steps taking many hundred thousand computer hours. The SequenceFlow App added convenient subclasses to track NGS-specific targets: NgsSample describing DNA material; Flowcell for the Illumina chips; and LibraryLane to define libraries, their application to flowcell lanes, and optionally tracking of barcodes. Furthermore, subclasses of Protocol were added to define composite protocols: Workflow to define sequences of steps; and WorkflowElement to define for each step what Protocol should be used. Finally, elements to capture computational workflows were added, such as ComputeProtocol to define the analysis scripts for sequence alignment and SNP calling that consist of 24 steps using widely accepted NGS tools [Depristo et al., 2011]. All parameters involved (such as “reference genome”) as well as QC measurements produced (such as “number of reads aligned”) can be reported as observed Values. Many software components were added to work with this model, most notably an automated execution system to run all analysis jobs on a PBS computer cluster. Example data can be queried at http://www.molgenis.org/wiki/SequenceFlow.
The purpose of databasing observation information is to join many datasets together and do intelligent reasoning across them. Observ-OM captures the core information needed for describing scientific observations, and provides a common language that can be used to harmonize representation and supporting software implementations. This makes data integration easier, for example, a whole population study of blood pressure may gather data from many centers, to be sure how those datasets can or cannot be integrated and make this possible they need to be pooled into one system. Observ-OM makes this obvious without humans having to second guess “what the column headers mean” and “whether the centers used equivalent protocols and measurement units.”
Observ-OM and Observ-TAB have proven their ability to capture many different modalities of phenotype and genotype observations, ranging from large-scale molecular profiling in model organisms to deep characterization in manually curated patient databases. To use Observ-OM optimally in these situations, a powerful symmetry in the nature of the Target and the Feature classes needs to be understood, namely, that certain types of Features can be (and often need to be) used as exact equivalents of Targets. For example, in one record, a “T” variant might be a Permitted Value for a particular polymorphic DNA sequence Feature (marker “rs429358” for example), but in another record, that same “T” allelic variant might be employed as a Target in an in silico assessment of that variant's likely pathogenicity. This dual role is conveniently implemented by having a Characteristic with name “rs429358 = T” and using this both as a Feature with options “Yes/No/Unknown” (i.e., question “is rs429358 = T”?) and as a Target simply by always assuming the “Yes” value in that latter role.
Considering how Observ-OM might be similar to and compatible with other systems, at the most basic level, we recognize that our system's Target–Feature–Value foundation mirrors the well-known “Entity–Attribute–Value” database table structure often used, for example, in biobanking databases [Dinu and Nadkarni, 2006]. As compared with regular relational database tables, this approach gains great flexibility (users can add new columns/features without additional programming) at the cost of relative difficulty when working with large data volumes (the database must work harder to bring data back into tabular view), which is overcome with optimized storage and indexing technologies. Furthermore, this basic “trio” arrangement brings direct compatibility with the subject–predicate–object triples that are at the core of “semantic Web” and “nanopublication” approaches [Groth et al., 2010]. This broad convergence to a very simple basal structure for stating the properties of an entity reflects the most elegant and effective approach to managing highly complex and diverse information.
On a more domain-specific level, Observ-OM was built by close reference to the progress made by other biomedical models: MAGE-TAB, FuGE-OM, XGAP-OM, PaGE-OM, OBiBa, SAIL, and GenomeEUTwin; the natural convergence points to these are shown in Supp. Table S1. For example, our system allows integration with analysis tools directly, so some automated export from DataSHaPER or SAIL of core phenotype metadata would allow users to prepopulate their Observ-OM instances and then request the underlying data. This would clearly need to be moderated to facilitate ethical approval, but would limit the need for reloading the primary data from external instances each time a change is made in the source database.
There are also some parallels between our approach and the HL7 messaging format for clinical observation and laboratory results (www.hl7.org). Within each record, one field (OBX-3) identifies the feature being tested, and another (OBX-5) carries the observation value. In this regard, Feature and Value map to OBX-3 and OBX-5, respectively, and the model presented here is potentially HL7 compatible while also supporting animal studies. The LifeLines project is actively exploring HL7 for their core database and therefore preliminary HL7 to Observ-OM mappings have been made to present the complex HL7 data in a more researcher-friendly format. Additionally, we find support for our modeling strategy in prior work in the field [Fowler, 1997], and we urge modeling experts in particular to look at those developments where they have used similar concepts such as “Observables” or “Characteristics.”
Some would argue that XML-based formats are more appropriate for data exchange than tabular formats. However, as demonstrated and explained by the MAGE-TAB project [Rayner et al., 2006], this is not universally true—XML is most appropriate for representing tree-like structures, whereas protocol application graphs are directed acyclic graphs. More importantly, editing XML files is much to ask from researchers while documents in tabular format can be created, viewed, and edited in essentially any spreadsheet software (e.g. Microsoft Excel), which is typically familiar to biologists, who commonly use spreadsheets to maintain notes and track data.
Discussion is also warranted on the matter of quality control (QC), especially for the archiving of datasets. The format itself provides basic QC in that it can check whether the data (Observed Value) follow the definitions of Measurement and Protocol. In addition, the ability of our systems to export data to R and other analysis applications allows QC processes to be run repeatedly and automatically as data are added, removed, or edited. This is particularly important when working with long-term experiments (e.g., epidemiology or mouse phenotyping) where factors such as baseline drift and instrument calibration errors may need to be detected and modeled continuously. It also allows researchers to access partial data as it appears and limits the need for biologists to develop these pipelines for each new test. This information is represented in our Protocol/Protocol Application package, and the archiving of detailed protocols allows iterative quality analysis/control over experimentally induced error as data are collected, rather than at some arbitrary end point where corrections may no longer be possible.
Future work for our project includes incorporation of minimal metadata standards [Kettner et al., 2010] in the form of a “minimal set of observation protocols and measurements” which define the content of what is stored. As these emerge, it is simple to implement checks for these and to report missing or erroneous data (e.g., inconsistencies in referencing ontologies). As our system is already implemented and validated in many projects at different locations, some support for federated queries (e.g., using the available REST Web services or R object interfaces) would allow users to expose all or parts of their data for external query and analysis. This would further add to our system's proven ability to enable the access, sharing, integration, and archiving of complex phenotypic data at different levels of granularity and availability.
We thank CASIMIR, Europhenome, dbGaP, MolPage, MPD, OBiBa, GenomeEUTwin, String of Pearls Initiative (PSI), and LifeLines for kindly sharing their metadata to test and validate our developments, and P3G for organizing the data model harmonization session in Luxembourg, 2009.
- 2012. xQTL workbench: a scalable web environment for multi-level QTL analysis. Bioinformatics [Epub ahead of print]. , , , , , ,
- 2009. The phenotype and genotype experiment object model (PaGE-OM): a robust data structure for information related to DNA variation. Hum Mut 30:968–977. , , , , , , , , , , , , and others.
- 2007. The Framingham Heart Study 100K SNP genome-wide association study resource: overview of 17 phenotype working group reports. BMC Med Genet 8:S1. , , , , , , , , , , , , and others.
- 2010. Locus Reference Genomic sequences: an improved basis for describing human DNA variants. Genome Med 2:24 , , , , , , , , , , , , and others.
- 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498. , , , , , , , , , , , and others.
- 2007. Guidelines for the effective use of entity-attribute-value modeling for biomedical databases. Int J Med Inform 76:769–779. , .
- 2011. LOVD v.2.0: the next generation in gene variant databases. Hum Mutat 32:557–563. , , , , ,
- 1997. Analysis patterns: reusable object models. Vol. 10. Boston, MA: Addison-Wesley; pp. 357.
- 2011. SAIL—a software system for sample and phenotype availability across biobanks and cohorts. Bioinformatics 27:589–591. , , , , , , , , .
- 2010. The anatomy of a nanopublication. Information Services and Use 30:51–56. , , .
- 2009. Mouse phenome database. Nucleic Acids Res 37:D720-D730. , , ,
- 2010. WormBase: a comprehensive resource for nematode research. Nucleic Acids Res 38:D463–D467. , , , , , , , , , , , , and others.
- 2009. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA 106:9362–9367. , , , , , ,
- 2007. The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics. Nat Biotechnol 25:1127–1133. , , , , , , , , , , , , and others.
- 2010. Meeting Report from the Second “Minimum Information for Biological and Biomedical Investigations” (MIBBI) workshop. Stand Genomic Sci 3:259–266. , , , , , , , , , , , , and others.
- 2003. Data modeling and data communication in GenomEUtwin. Twin Res 6:383–390. , , , ,
- 2007. The NCBI dbGaP database of genotypes and phenotypes. Nat Genet 39:1181–1186. , , , , , , , , , , , , and others.
- 2008. EuroPhenome and EMPReSS: online mouse phenotyping resource. Nucleic Acids Res 36:D715–D718. , ,
- 2010. EuroPhenome: a repository for high-throughput mouse phenotyping data. Nucleic Acids Res 38:D577–D585. , , , , , , , , , , , , and others.
- 2006. A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB. BMC Bioinformatics 7:489. , , , , , , , , , , , , and others.
- 2005. The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biol 6:R7. , ,
- 2007. Beyond standardization: dynamic software infrastructures for systems biology. Nat Rev Genet 8:235–243. , .
- 2010. XGAP: a uniform and extensible data model and software platform for genotype and phenotype experiments. Genome Biol 11:R27. , , , , , , , , , , , , and others.
- 2009. HGVbaseG2P: a central genetic association database. Nucleic Acids Res 37:D797–D802. , , , , , , , .
- 2011. The international dystrophic epidermolysis bullosa patient registry: an online database of dystrophic epidermolysis bullosa patients and their COL7A1 mutations. Hum Mutat 32:1100–1107. , , , , , , , , , , , , , .
- 2008. Biobanking for Europe. Brief Bioinf 9:14–24. , , , , , , , , , , , .
Additional Supporting information may be found in the online version of this article
Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.