Modern biomedical research employs high-throughput technologies, large collections of biosamples, complex experimental and analysis protocols, time series and dependent annotations, and results. These elements constitute observations (“phenotypes” in the broadest sense of the word) and contextual information for these observations, all of which need to be shared appropriately as data and metadata. Human and model organism phenotypes (e.g., mouse knockouts) similarly need to be linked, not least for optimal clinical research. Accessing and integrating these phenotypic and molecular data from the thousands of local biobanks, project-specific databases, and broad public repositories in which they are held is hindered, not least, by differences in how the information is structured. Often times, this entails reading free text descriptions where no structured phenotypic representation exists. For example, complex phenotypic data from genome-wide association studies (GWAS) in resources such as the GWAS Central database [Thorisson et al., 2009] and the National Human Genome Research Institute catalogue [Hindorff et al., 2009] are initially collected by labor intensive manual extraction from the literature by database curators, as the phenotypes are not published at source in a standard, exchangeable format. Similarly, the Biobanking and Molecular Resource Infrastructure (BBMRI) biobanking project consolidates 280 specialist European Biobanks and other organizations who preserve millions of samples and associated annotations in diverse local schemas or formats [Yuille et al., 2008]. Other projects populate DNA variants and phenotypes into thousands of locus-specific databases (LSDBs) [Fokkema et al., 2011], and organize high-resolution phenotype findings in numerous model organism databases such as WormBase [Harris et al., 2010] and the Mouse Phenome Database [Grubb et al., 2009]. It is essential that integrated querying across such sets of phenotype information be made possible, and this will require the widespread use of powerful, standardized, user friendly solutions regarding data syntax.
To this end, several EU and national organizations (EU-GEN2PHEN, EU-PANACEA, EU-BioSHaRE, BBMRI-NL, BBMRI.FI, EBI, P3G, NBIC, and LifeLines) joined forces to innovate a lightweight informatics system for storing and sharing observation data. We present this group's progress by first summarizing “Observ-OM”: a minimal model that defines and interrelates the conceptual subcomponents that are intrinsic to any and all kinds of “observation” (which within the realm of biomedicine would be called “phenotypes”). We then describe “Observ-TAB”: a simple and yet powerful spreadsheet format based upon Observ-OM that can be easily adopted by individual researchers, larger consortia, and biobanks without extensive software engineering. Finally, a range of real-world software applications are described, which demonstrate the practical utility of these two solutions.
Observ-OM: A General “Phenotype” Observation Model
Despite the vast complexity and diversity of possible biomedical observations, all such information can, in principle, be described via a straightforward set of four underlying concepts. These are specified by Observ-OM, an “object model” to describe the principle data elements and their relationships, which was drafted as general solution in a series of interactive workshops organized by the GEN2PHEN project, involving experts from model organism and human research communities and employing diverse epidemiological, phenotypic, genetic, and molecular profiling strategies. In parallel, existing object models and tabular formats (TAB), such as MAGE-TAB [Rayner et al., 2006], FuGE-OM [Jones et al., 2007], eXtensible Genotype And Phenotype (XGAP)-OM [Swertz et al., 2010], PaGE-OM [Brookes et al., 2009], OBiBa (http://www.obiba.org), SAIL [Gostev et al., 2011], dbGaP [Mailman et al., 2007], and GenomeEUtwin [Litton et al., 2003] were carefully evaluated and used for guidance on validated successful design patterns and to ensure technical interoperability (see Table 1).
|API||Application programming interface||–|
|BBMRI||EU ESFRI project integrating biobank data across countries||www.bbmri.eu|
|BBMRI.FI||Finnish BBMRI hub||www.bbmri.fi/en|
|BBMRI-NL||The Netherlands BBMRI hub||www.bbmri.nl|
|BioSHaRE||EU biobank data integration project||www.bioshare.eu|
|DataSHaPER||Biobank harmonization platform of Public Population Project in Genomics (P3G)||www.datashaper.org|
|GEN2PHEN||EU project integrating genotype and phenotype data||www.gen2phen.org|
|GUI||Graphical user interface||–|
|GWAS Central||A centralized compilation of summary level findings from genetic association studies||www.gwascentral.org/|
|HL7||Health Level Seven||www.hl7.org|
|MAGE-TAB||A tab-delimited format for representing functional genomics data||www.mged.org/mage-tab|
|MOLGENIS||Software generating infrastructure (databases, APIs, GUIs) for life science projects||www.molgenis.org|
|OBiBa||A suite of open source software for biobanks from P3G||www.obiba.org|
|Object Model||An abstract representation of a domain's concepts, data and relationships between these used to design or generate software.|
|P3G||Public Population Project in Genomics (P3G) is a international consortium for population genomics.||www.p3g.org|
|PaGE-OM||An object model for Genotype and Phenotype data||www.pageom.org|
|PANACEA||Quantitative pathway analysis of natural variation in complex disease signalling in C. elegans|
|Observ-TAB||A tab delimited format derived from Observ-OM used to represent and exchange phenotype data||www.observ-om.org|
|XGAP||XGAP is an open and flexible object model for xQTL, GWL, GWA and mutagenesis data||www.xgap.org|
A universal way to conceptualize phenotypic and other observations came out of this work, entailing just four core concepts: an observation Target, an observable Feature, an observation Protocol (and its Application), and the observed Value (see Fig. 1). The observation Target is the entity to which the phenotype information relates. It can accommodate any kind of object or concept, such as a panel of case/control samples as in the Framingham study in dbGaP [Cupples et al., 2007], a whole mouse strain as in Europhenome [Morgan et al., 2010], or an individual specimen as in MPD [Grubb et al., 2009]. The observable Feature is the character or “question” under consideration (see below for further details on how this question is structured). The observation Protocol describes the method or procedure used to assess the Feature, whether this is a clinical, laboratory, computational, inferential, or statistical process, along with its actual instance of use in Protocol Application. Finally, the observed Value defines the observation itself. For example, an observed Value of “140.1” could be reported for a particular Target = “Total consent group” for Feature = “BLOOD PRESSURE, SYSTOLIC, FIRST READING TAKEN BY PHYSICIAN” using a particular examination Protocol = “FRAMINGHAM Clinic Exam, Original Cohort, Exam 10.” Figure 1 gives further examples of the use of this top-level phenotype model in practice.
A more detailed description of Observ-OM is provided in Supp. Materials and at the project Website (http://www.observ-om.org). Figure 2 elaborates the main elements in the model, specifies the key relationship roles between them by naming these connections, and highlights the importance of suitable ontologies in naming model elements, their attributes, and the data themselves. It also shows circular reflexive relations on all main elements to indicate that each such concept can be related to itself, for example, the Target part of a phenotype record could comprise a group of more than one different Targets, such as samples of four limbs from a mouse individual being evaluated for average degree of stunted growth.
Because the core elements are fairly abstract, additional flavors of Feature and Target can be added via convenient “subclasses” (marked by a triangle in Fig. 2, following object-oriented modeling conventions). Subclasses are identical to the core elements but with additional attributes, such as Panel (is a Target) having attribute “individuals” to group individuals as a cohort, Sample (is a Target) having attribute “source” and “tissue,” and Individual (is a Target) having attributes on “father” and “mother” relationships. Note that subclasses are for convenience only; their information can also be reported using core elements ensuring compatibility with the core model, for example, by using a Protocol that observes a Feature named “father” and a Feature named “mother” instead. Subclasses can be used in place of their “superclass,” that is, Individual and Panel can be a target for an Observed Value.
When using Observ-OM, it is particularly important to have a good understanding of how the Feature class is intended to operate. Specifically, Features are essentially questions about phenotypic or other attributes or traits that can be observed or derived by protocol, and these are split into Characteristics for situations where the objective is simply to record presence or absence (i.e., possible Values are yes/no/unknown) and Measurements if multiple answers (Values) are allowed that may or may not be constrained to a limited set or range of options. In case of a limited set, discrete possible Measurement values are termed a Permitted Value, represented by Characteristic. To illustrate this, Figure 2 gives the example of a specific Characteristic subclass called Variant, that is, a known DNA variant for a Marker at a certain bp position on a particular Chromosome, wherein Reference and Alternative alleles are known and for which the Feature question is therefore “sequence variant observed at this genome location” yes/no/unknown. Other examples we have validated in the contexts of quantitative trait loci (QTL), GWAS, LSDB, and next-generation sequencing (NGS) datasets are listed below as part of the description of various real-world applications.
Observ-TAB: A Simple Spreadsheet for Observation Data
To provide a direct practical means for sharing observation data without the need for complex informatics support, a tabular exchange format was developed (Observ-TAB) using the model concepts described above. Many researchers already use Excel spreadsheets to share “data matrices” with the results of their observation protocols, and Observ-TAB provides an exact fit into this way of working. As illustrated in Figure 3d, in the Observ-TAB format, each data row represents a Target (e.g., the “individual” subjects) and each data column represents a Feature (e.g., “[what is] individual's right ear pressure in daPa”), and each cell is a Value (e.g., “−108.0 daPa”). A particular research investigation, in this case, could comprise many Excel sheets (datasets) to record, for example, genotypes for many markers for a series of cohort members on one sheet, their phenotypes on another sheet, their individual case reports on yet another sheet, and resulting GWAS association P values for multiple markers against different subphenotypes on a final sheet (each datasheet relating to the protocol that was used).
Observ-TAB also readily enables the sharing of metadata on both Targets and Features, and this is a common use case in genetics. It entails simply adding a new Excel sheet every time one wishes to annotate the properties of any of the data (sub)classes in Observ-OM. For example, in Figure 3b and c, the sheet “Individual” lists all the individuals used as Target elements, and sheet “Measurement” lists all the phenotype assessments used as Feature elements (Measurement is a subclass of Feature, used in many of our real-world applications). In this format, the first column defines the unique name of each Target or Feature, and additional columns describe its properties. For example, the first Measurement entry has name = “Rpressure” in the first column, followed by properties description = “middle ear pressure at maximum compliance, right ear,” unit name = “daPa,” and dataType = “string.”
The use of Observ-TAB for metadata recording and sharing also extends to Protocols and Protocol Applications. This is important because precise reporting of which Protocol was used, and exactly how it was used, is critical for proper interpretation of complex phenotypic measurements, which depend on specific equipment or assay parameters that may vary between studies or even batches of observations. To capture the necessary information, the Protocol sheet should include columns to specify how the procedure is applied plus one column to list the Features (Measurements) that will be generated by the Protocol. Naturally, there could be additional Sub-Protocol sheets to give method details for the testing of each particular Feature. As described before, all Protocol Applications are reported in sheets having the same name as the protocol (Fig. 3d). Optionally, additional columns can be added to record meta information, such as date and time relevant to the particular occasion that the Protocol was applied to each test individual (Target). Alternatively, all Protocol Applications and Observed Values can be reported in separate sheets instead (Fig. 3e). Via the unique link from Value to ProtocolApplication, each Protocol Application acts to functionally group primary observations (Values) observed by each single application of a Protocol. This system works equally well for manual assessments (e.g., questionnaires), wet laboratory activities (e.g., sample library preparations), and computational protocols (e.g., “statistical association analysis”).
Where existing ontologies are available, our Observ-OM/TAB system promotes their use by allowing specific references to ontology terms. For example, elevated systolic blood pressure in mammals can be annotated with the MP:0006144 term defined in Mammalian Phenotype Ontology [Smith et al., 2005] as “abnormal increase in the pressure in the arteries as the heart contracts and pumps blood into the arteries.” Use of ontologies enhances searching by traversal of subtypes of disease, and also by use of the synonyms they contain. Use of ontology references promotes integration by ensuring that potentially ambiguous terms are linked to formal definitions. Valid Observ-TAB files are encoded as Excel or as a directory of tab-delimited/comma-separated values files (where each file is representing one data sheet).
Evaluation in Real-World Applications
To date, we have tested the Observ-OM and Observ-TAB system in more than 20 applications ranging from small-scale laboratory observations to high-throughput NGS and GWAS/GWL data from multiple species. In this section, we summarize some of these “apps” to highlight how the model and the format were used, and what data display or processing software were produced. All supporting software was built on the MOLGENIS biosoftware platform, which eases software sharing and reuse [Swertz and Jansen, 2007; Swertz et al., 2010]. A key feature of MOLGENIS is that it can read an object model and then autogenerate a matching database, complete with user interfaces and software interfaces to load Excel and CSV files, R statistics, and Web services. It is thus ideal for our needs, as we wanted to rapidly create subclasses of Features and Targets to easily apply the Observ-OM in each particular domain of use. In MOLGENIS, the core and subclass models are declared in human readable XML files. Full documentation on MOLGENIS model and generator is available at www.molgenis.org, and downloads of the model XML and software currently developed with Observ-OM are available in www.molgenis.org/svn/molgenis_apps/trunk.
App 1: Integration of Public Phenotype Observations from Mouse and Man
This use case was to collect publicly available observation data from human and mouse in dbGaP (www.ncbi.nlm.nih.gov/gap) [Mailman et al., 2007], Europhenome (www.europhenome.org) [Mallon et al., 2008; Morgan et al., 2010], and the Mouse Phenome Database (www.jax.org/phenome) [Grubb et al., 2009] and was used as a “reference implementation” to extensively test Observ-OM within the GEN2PHEN project. For each source, it was straightforward to create a data converter from the source to Observ-TAB format, using the MOLGENIS “CsvReader/Writer” toolbox. For example, in case of the Framingham study, data are provided in dbGaP–XML format, which is difficult to consume by humans. A simple converter from the dbGap–XML to Observ-TAB Excel greatly improved tractability and readability of the data in the loading phase. This first App used the core model (see Fig. 2b): Measurement to define Features having attributes name, description, unit, and dataType; Permitted Value to define categorical value options that are attached to a Measurement; Individual to use as Targets for individual-level observations; and Panel to describe a set of Individual and to use as targets for human-cohort/mouse-strain level observations. Example data can be queried at http://www.ebi.ac.uk/microarray-srv/pheno.
App 2: “xQTL” Research Portal for High-Throughput Observations
The developers of the XGAP system incorporated Observ-OM in their “xQTL workbench” [Arends et al., 2012]: a scalable Web system for the mapping of QTLs in mouse, worm, and plant model populations. Using Observ-OM, they can import/export, query, and analyze observation datasets and QTL profiles on gene expression (eQTL), protein abundance (pQTL), metabolite abundance (mQTL), and phenotype (phQTL) levels. The system is used by EU-PANACEA project for multiomics in Caenorhabditis elegans and by the LifeLines human biobanks phenotypes and genotypes (65,000+ samples), in both cases to provide their researchers with an integrated research portal for all their epidemiological, GWAS, and QTL analysis (e.g., replacing R/qtl with PLINK but using the same core model between them).
The xQTL App added approximately 50 convenient subclasses to Observ-OM such as Marker, SNP, MicroarryProbe, MassPeak, Metabolite, NMR, Clone, Factor, Gene, Tissue, Analysis and DataSet, to name a few. Also, software modules were developed for Observ-OM data including Excel upload wizard, Matrix data browser, and Interface to the R language. To improve scalability when storing millions of gene expression, genotype, and QTL observations, xQTL implemented a binary format for data matrices as well as “drivers” to directly load the values from existing genome variation file formats such as PLINK and VCF. Example data sets can be queried at http://www.xqtl.org.
App 3: “DEB-Central” for Dystrophic Epidermolysis Bullosa Phenotypes and Genotypes
Dystrophic Epidermolysis Bullosa (DEB) is an inherited blistering disorder caused by alterations in the COL7A1 gene and covers a group of several distinctive phenotypes. Although general genotype–phenotype correlation rules have emerged, many exceptions to these rules exist, hindering disease diagnosing and genetic counseling. Therefore, an International DEB registry was set up to elucidate all these cases by collecting a wide variety of DEB genotype–phenotype observations [Van Den Akker et al., 2011].
The DEB-Central App uses Observ-OM to enable researchers to enter new phenotypes by simply adding new Protocols (e.g., “Cutaneous”) for particular Measurements (e.g., hands, feet, arms, proximal body flexures) without the need for any programming. Before we established Observ-OM, all phenotypes were stored using traditional database tables having one column per phenotype. This was a major drawback in that the addition of a new phenotype required a programmer to add a new column to these tables, which was overcome using Observ-OM. We also added convenience subclasses to define genetic variants: MutationGene to store the locus reference genomic [Dalgleish et al., 2010] nucleotide sequence of the gene (LRG_286 at http://www.lrg-sequence.org); Exon to define the intron and exon boundaries; ProteinDomain to define domains on the gene; and Mutation to define cDNA, gDNA, and AA sequence alterations in HGVS notation. Example data can be queried at http://www.deb-central.org.
App 4: “AnimalDB” to Track and Trace of Animal Life Events in Research Laboratories
The Life Sciences Department of Groningen University developed AnimalDB to keep track of animals, locations, breeding, and experiments in and across departments. At the end of the year, the system needs to automatically generate annual reports on animal use, as demanded by government. Animals are stored as Individuals, and Panel is used whenever animals need to be grouped, mostly in the context of breeding (e.g., parent groups and litters). All the animal properties (such as sex, species, genotype) are stored as observed Values. The same holds true for all experimental measurements performed upon the animals. The timestamp attribute of Value is used for experimental measurements that can be assessed multiple times. On each such occasion, a new unique Value is created with a distinct time stamp and belonging to different protocol applications describing the observation event. Example data can be queried at http://www.animaldb.org.
App 5: “SequenceFlow” for Laboratory Work and Analyses in NGS
The “Genome of the Netherlands” (GoNL) is an ambitious NGS project that aims to sequence the genomes of 770 Dutch individuals within BBMRI-NL (http://www.nlgenome.nl). The SequenceFlow App was developed to track the hundreds of Illumina sequence lanes (Targets), dozens of analysis Protocols, and to schedule and run thousands of analysis steps taking many hundred thousand computer hours. The SequenceFlow App added convenient subclasses to track NGS-specific targets: NgsSample describing DNA material; Flowcell for the Illumina chips; and LibraryLane to define libraries, their application to flowcell lanes, and optionally tracking of barcodes. Furthermore, subclasses of Protocol were added to define composite protocols: Workflow to define sequences of steps; and WorkflowElement to define for each step what Protocol should be used. Finally, elements to capture computational workflows were added, such as ComputeProtocol to define the analysis scripts for sequence alignment and SNP calling that consist of 24 steps using widely accepted NGS tools [Depristo et al., 2011]. All parameters involved (such as “reference genome”) as well as QC measurements produced (such as “number of reads aligned”) can be reported as observed Values. Many software components were added to work with this model, most notably an automated execution system to run all analysis jobs on a PBS computer cluster. Example data can be queried at http://www.molgenis.org/wiki/SequenceFlow.