SEARCH

SEARCH BY CITATION

Keywords:

  • semantic annotation;
  • data curation;
  • ontology

SUMMARY

  1. Top of page
  2. SUMMARY
  3. 1 INTRODUCTION
  4. 2 RIGHTFIELD: ONTOLOGY-ENRICHED DATA COLLECTION BY STEALTH
  5. 3 POPULOUS: ONTOLOGY DEVELOPMENT BY STEALTH
  6. 4 DISCUSSION
  7. ACKNOWLEDGEMENTS
  8. REFERENCES

The increase in volume and complexity of biological data has led to increased requirements to reuse that data. Consistent and accurate metadata is essential for this task, creating new challenges in semantic data annotation and in the constriction of terminologies and ontologies used for annotation. The BioSharing community are developing standards and terminologies for annotation, which have been adopted across bioinformatics, but the real challenge is to make these standards accessible to laboratory scientists. Widespread adoption requires the provision of tools to assist scientists whilst reducing the complexities of working with semantics. This paper describes unobtrusive ‘stealthy’ methods for collecting standards compliant, semantically annotated data and for contributing to ontologies used for those annotations. Spreadsheets are ubiquitous in laboratory data management. Our spreadsheet-based RightField tool enables scientists to structure information and select ontology terms for annotation within spreadsheets, producing high quality, consistent data without changing common working practices. Furthermore, our Populous spreadsheet tool proves effective for gathering domain knowledge in the form of Web Ontology Language (OWL) ontologies. Such a corpus of structured and semantically enriched knowledge can be extracted in Resource Description Framework (RDF), providing further means for searching across the content and contributing to Open Linked Data (http://linkeddata.org/). Copyright © 2012 John Wiley & Sons, Ltd.

1 INTRODUCTION

  1. Top of page
  2. SUMMARY
  3. 1 INTRODUCTION
  4. 2 RIGHTFIELD: ONTOLOGY-ENRICHED DATA COLLECTION BY STEALTH
  5. 3 POPULOUS: ONTOLOGY DEVELOPMENT BY STEALTH
  6. 4 DISCUSSION
  7. ACKNOWLEDGEMENTS
  8. REFERENCES

Howe et al. in [1] make the compelling case that although databases have become important avenues for publishing biological data, the curation increasingly lags behind data generation in funding, development and recognition. The interpretation and reuse of data rely on it being annotated with rich, standardised metadata. However, the labour cost of metadata construction, data curation and annotation is high: it is a time-consuming and awkward process, which is undervalued and provides little reward for the skilled data curators that undertake it [2, 3].

In an effort to drive-up the quality of curation at the point of data capture, various biological science communities have proposed minimum information models and community ontologies that should be used for structuring and describing these metadata. The models set community-wide expectations for the least information required to understand and interpret experimentally gathered data. Recommendations for the use of specific terminologies and ontologies clarify the information required and the choice of terminologies for annotation, and encourage widespread compliance [4].

Minimum Information about a Biological or Biomedical Investigation (MIBBI) [5] is an umbrella organisation describing minimum information models for over 30 different biological fields or experimental technologies. The approach is a pragmatic one, with many of the models little more than checklists. Many MIBBI checklists recommend the use of specific terminologies and ontologies for annotation. These ontologies are published in the BioPortal [6], a community repository for biological ontologies. Table 1 shows some minimum information models and corresponding ontologies commonly used in the life sciences.

Table 1. Examples of standards and ontologies for data annotation. Other ontologies are more generally applicable and are used across the domain. For example, Ontology for Biomedical Investigations (OBI) [7] can be used to describe experimental properties, and the Gene Ontology [8] can be used to describe functional properties of genes and gene products.
DataMIBBI ModelOntologies
MicroarrayMIAME: Minimum Information about a Microarray ExperimentMO: MGED Ontology
ProteomicsMIAPE: Minimum Information about a Proteomics ExperimentMS: mass spectrometry; MOD: Protein Modification; SEP: sample processing and separation
Interaction experimentsMIMIX: Minimum Information about a Molecular Interaction ExperimentMI: protein–protein interaction
Systems Biology ModelsMIRIAM: Minimum Information Required In the Annotation of biochemical ModelsSBO: Systems Biology Ontology
Systems Biology Model SimulationMIASE: Minimum Information About a Simulation ExperimentKISAO: Kinetic Simulation Algorithm Ontology

Data management specialists routinely use these MIBBI checklists and related terminologies to capture and standardise the structure of data for sharing or for deposition into public repositories. However, the uptake of such standards by experimental biologists is much less common. One significant reason for this difference is the ease with which the metadata standards can be applied to data. The existence of tools that assist with the process can increase compliance significantly. For example, Minimum Information about a Microarray Experiment (MIAME) was initially implemented as The MicroArray Gene Expression Markup Language (MAGE-ML), an (Extensible Markup Language) XML-based schema. Users were required to submit data in this format to ArrayExpress (http://www.ebi.ac.uk/arrayexpress/) or GEO (Gene Expression Omnibus http://www.ncbi.nlm.nih.gov/geo/) in order to publish work relating to it. In practice, this was only feasible for laboratories with dedicated bioinformatics support. The community developed MicroArray Gene Expression Tab (MAGE-TAB) [9], a tabular alternative that allows scientists to prepare MIAME compliant data from within Microsoft Excel spreadsheets to lower the barrier of adoption of the standard. This more straightforward process requires no prior knowledge of MAGE or XML and enables experimentalists to use the tools that they already use to record their data and are highly proficient experts in. MAGE-TAB demonstrates the importance of providing tools and interfaces as well as metadata standards and schemas [10].

There is a similar situation in the development of the terminologies or ontologies that supply the metadata for these annotation tasks. For a good introduction to ontologies in biology, see [11]; here, we briefly set the context for the paper. Ontologies are used to define the entities and the relationships between those entities [12]. The labels used for the concepts describing the domain form a controlled vocabulary for that domain. Ontologies are now widely used in this way within biology; they provide metadata for describing the data. Ontologies come in a variety of forms, from simple trees or polyhierarchies, through to highly axiomatised ontologies amenable to automated reasoning.

In biology, two languages for authoring ontologies are widely used: Web Ontology Language (OWL) and Open Biomedical Ontology (OBO). The OWL (http://www.w3.org/2007/OWL) is widely used for authoring ontologies as a collection of axioms, each of which has a precise semantics that make such ontologies amenable to automated reasoners. OWL and its ontologies are hard to understand and use; the tools for developing OWL ontologies, such as Protégé (http://protege.stanford.edu) are best used in the hands of specialists. The OBO (http://www.geneontology.org/GO.format.obo-1_2.shtml) format is a common format for biological ontologies. OBO is less complex than OWL and may therefore be easier to understand, but the large number of OBO ontologies available and the large numbers of terms within them, make the task of exploration difficult. Although ontologies play an important role in many fields, including biology, they are best used through tools that present them in a form compatible with everyday working practices of users. For example, ontologies, irrespective of their level of axiomatisation, are usually used by biologists as taxonomies of vocabulary terms and they are best presented as such.

Our aim is to make the collection, validation and construction of accurate and precise metadata a simple task: without using special tools, without the need for our laboratory experimentalist to be exposed to rich and complex ontologies and with the tools and techniques they are already familiar with. As the most common data collection and exchange file is the Microsoft Excel spreadsheet, we use standard Excel spreadsheets.

  • To enable ontology-enriched data collection. We have developed the RightField ontology annotation and information management desktop application (http://www.rightfield.org.uk) [13]. RightField adds constrained ontology term selection to Excel spreadsheets, addressing issues such as multiple ontology loading, display and browsing, multiple ontology formalisms, ontology evolution and ontology provenance, off-line working, cross-platform support, Excel application legacy, ontology selection depths and nesting, and format import–export.
  • RightField and the spreadsheets it produces are already in use in a large Systems Biology programme incorporating over 300 scientists from 120 institutes. It has significantly improved the ability for laboratory experimentalists to self-curate their data from a familiar environment and a lower barrier of entry to the process of semantic annotation, helping to ensure the quality of data annotation and supporting standards compliance. When data is required to comply with particular standards in order to be published, scientists already have it preformatted. Although developed for Systems Biology, the software is generic. For example, the Gas and Oil industry's instrument data sheets are spreadsheet-based and are in need of a tool such as Rightfield with the added capability of a local dialect to core term mapping functions.
  • To enable ontology content collection. We have developed the Populous ontology population desktop application (http://www.populous.org.uk) [14]. Populous extends RightField by adding support for the creation and population of ontology design patterns. Specifying good ontology design patterns in a language such as OWL can be difficult and requires knowledge of OWL syntax, OWL semantics, knowledge representation and the ontology authoring tools. The task of populating the design pattern however, requires only an understanding of the pattern and sufficient domain knowledge in order to populate it. The Populous extension to RightField provides support for domain experts in populating design pattern templates using a familiar spreadsheet style interface. Once the template is populated, Populous uses the Ontology Pre-Processing Language (OPPL) [15, 16] to map the data in the columns to variables within the design pattern to generate axioms for the growing ontology. Populous has been successfully used in the e-LICO (http://www.e-lico.eu) project for the construction of an ontology describing the Kidney and Urinary Pathway (KUP) [17]. The KUP ontology (KUPO) is used to annotate experimental data held in the KUP Knowledge-base (KUPKB). The challenge was to engage the biologists in the ontology's development, whilst shielding them from details of the ontology itself. Experienced ontology developers defined the set of design patterns for KUPO and generated a set of templates for the biologists using RightField. The content for KUPO was populated entirely by biologists with little or no previous knowledge in ontology development. The community can now submit new experiments to the KUPKB using RightField-generated spreadsheets that contain metadata coming from KUPO.
  • To enable information model design. We combine RightField and Populous. The overall structure for the KUPO was developed by bio-ontologists, and Populous was used to populate the patterns implied by the KUPO. Portions of the KUPO can then be used within a RightField spreadsheet to help biologists provide content to the KUPKB. As the biologists used the KUP-orientated RightField data submission form, gaps in the KUPO will be spotted; Populous can be returned to update the KUPO. The KUPO provides an information model for placing into RightField; as the RightField templates are used, the information model can change. The Populous extension to RighField then allows the same biologists to expand the ontology that forms the information model for the RightField generated spreadsheets.
  • RightField and Populous are so-called Data Ramps [18], tools that lower the barriers to using and manipulating data. These tools are not Excel plugins; instead, they are Excel spreadsheet generators. The spreadsheets are plain Excel spreadsheets that use only the standard features of Excel to produce drop-down lists of allowed cell values. In this way, the features previously outlined are ensured. The tabulation and constraints within spreadsheets generated by RightField means that data collected using RightField templates is highly structured and can be exported and stored in relational databases, XML or RDF. As all, this is achieved using only Excel features in an environment familiar to target users, these highly structured data are produced by stealth, avoiding the need to develop, install and train users with a bespoke tool. RightField and Populous are important components towards quality-assured Linked Data. Linked data is a method for exposing, sharing and connecting pieces of data, information and knowledge on the Semantic Web using unique identifiers and common formats. These tools automatically produce semantically annotated life science data that could be exported to the Linked Data cloud in a structured format such as RDF.

RightField and Populous are open source Java applications distributed under the Berkeley Software Distribution (BSD) licence and available from http://www.rightfield.org.uk and http://www.populous.org.uk, respectively.

The paper is organised as follows. In Section 2, we present the motivation for RightField and demonstrate how it is used for semantically enriched and validated data annotation against project-defined information model templates by the pan-European SysMO Systems Biology Consortium. In Section 3, we present the motivation for Populous and demonstrate how it is used to gather content and populate ontologies against pre-defined patterns by the European Union e-LICO Project. We also propose how RightField and Populous can be used to define and validate new minimum information models. In Section 4, we reflect on our experiences of the tools in the field, including shortcomings, and set out our future plans with respect to Linked Data. Related work is referred to throughout the sections.

2 RIGHTFIELD: ONTOLOGY-ENRICHED DATA COLLECTION BY STEALTH

  1. Top of page
  2. SUMMARY
  3. 1 INTRODUCTION
  4. 2 RIGHTFIELD: ONTOLOGY-ENRICHED DATA COLLECTION BY STEALTH
  5. 3 POPULOUS: ONTOLOGY DEVELOPMENT BY STEALTH
  6. 4 DISCUSSION
  7. ACKNOWLEDGEMENTS
  8. REFERENCES

A classic example of the challenges of accurate yet onerous scientific data collection is demonstrated by the SysMO consortium (Systems Biology of Microorganisms http://www.sysmo.net). This consortium is a pan-European research initiative to record and describe the dynamic molecular processes occurring in microorganisms and to present these processes in the form of computerised mathematical models. The diversity of experiments performed across the 13 multi-partner projects varies widely. Its scientists typically perform a mixture of high-throughput 'omics experiments, such as microarray analysis or proteomics, as well as traditional molecular biology and enzyme reaction kinetics. Despite this diversity, the consortium aims to pool its research capacities and know-how, sharing data, methods, models and results within the consortium and with the rest of the Systems Biology community. It has developed, through a specially funded SysMO-DB project [19], a suite of resources and a web-based platform, the SEEK [20], to provide a data sharing environment for scientists within the consortium, a dissemination mechanism to share research with the wider community and a gateway to other commonly used community resources. The most common data collection and exchange format is the Microsoft Excel spreadsheet.

There is no additional curation resource outside the projects so data curation must be handled at source by the scientists. The laboratory scientists in SysMO (some 300 drawn from over 120 institutes across Europe) have little experience in metadata management and semantics but understand the value of reusing data. The time they have to spend on data curation is limited, so they will only adopt community standards if there are clear advantages in doing so. The range of MIBBI models and accompanying ontologies they are expected to comply with is wide, confusing and daunting; the metadata models are complicated and impractical, requiring familiarity with XML formats, ontology languages and formats, and the annotation of typically over 100 metadata elements. Often, multiple ontologies are needed for the annotation of a single file.

The SysMO-DB project has improved ‘curation at source’ by leveraging the widespread use of, and familiarity with, spreadsheets. Experimentalists collect data against standard Excel spreadsheet templates based on variations of the MIBBI community models. Terms from ontologies are exposed as simple dropdown cell selection lists, shielding the scientist from understanding and browsing any ontologies and guiding them into only choosing valid terms.

To enable expert bioinformaticians to make these templates, we have developed the RightField ontology annotation and information management desktop application for adding constrained ontology term selection to Excel spreadsheets. Thus, SEEK users are encouraged to upload spreadsheets that are RightField enabled.

RightField is an administrator's tool. The workflow is a three-stage process.

  1. An expert bioinformatician or data manager prepares data collection templates using standard Excel. These are imported into RightField. Ontologies are accessed and browsed through the BioPortal, a repository of biological ontologies (http://bioportal.bioontology.org/) or from their local file system (Figure 1(a)). The tool supports ontologies in OBO (http://www.geneontology.org/GO.format.obo-1_2.shtml), OWL and RDFS formats and simple RDF vocabularies. Ontology terms are selected and bound to the cells. Terms can be all the subclasses from a chosen class, all the individuals (instances), or a combination of both (Figure 1(b)). Individual cells or whole columns or rows can be marked with the required ranges of ontology terms (Figure 1(c)). A spreadsheet can be annotated with terms from multiple ontologies. The full Internationalised Resource Identifiers (IRIs) and associated labels for these ontology elements are retained and discreetly stored within hidden sheets in the spreadsheet file. This affords a full provenance track concerning the origins and versions of the ontology terms used in the annotation. The spreadsheets are saved in native Excel format.
  2. A laboratory experimentalist opens the RightField enabled spreadsheet in standard Excel or OpenOffice (Figure 2). No plugins are required, and there are no macros, visual basic code or platform-specific libraries needed for its use. No connection to any other service is required: the spreadsheet is self-contained as terms are embedded within its file. The spreadsheet behaves as usual; however, marked-up cells (Figure 2(a)) are restricted to selections from a simple drop-down list of terms using the term labels (Figure 2(b)). The ontologies have been bridged and flattened into validated vocabulary pick-lists, so no in-depth knowledge of the ontologies or the reporting standards are required. By selecting a term, the IRI of the ontology term is attached to the cell for that data sheet. Once marked-up and saved, the spreadsheet data can link back to the original ontology. The default behaviour is to retain this information in a hidden state to shield the user from complexities of the underlying semantics. However, a feature to toggle between term labels and identifiers has subsequently been included at the request of the expert bioinformatician and the curious experimentalist.
  3. The marked up datasheets now have embedded annotation bound to the IRIs of ontology terms. When viewed using Excel or OpenOffice, these appear as labels (Figure 2(b)). However, for semantically capable applications, these encapsulated ontology IRIs are available for extraction and processing alongside the data they annotate. This has the added benefit that the data is ready for publishing into the Linked Data cloud.
image

Figure 1. A screenshot of RightField showing a loaded spreadsheet template and the process of marking cells with ranges of ontology terms from the Microarray Gene Expression Data (MGED) ontology [21].

Download figure to PowerPoint

image

Figure 2. Excel spreadsheet using a Rightfield template.

Download figure to PowerPoint

Figure 3 describes the RightField architecture. RightField combines the use of the Apache POI library to read and manipulate spreadsheets and the [22] OWL Application Programming Interfac (API) to read and process ontology files. In order to achieve the seamless integration with Excel, RightField augments an existing Microsoft Excel Workbook with the data as follows. For each distinct set of ontology terms selected by the informatician, RightField adds a ‘very hidden’ worksheet to the workbook. Each worksheet records each term in the set as a tuple containing the human readable term name, its unique identifier (IRI) and the most specific ontology that ‘defines’ the term. If the associated defining ontology was obtained from the internet or from the BioPortal ontology repository, then RightField records the ontology document IRI, both for versioning purposes, and so that the original ontology can automatically be reloaded if the workbook is re-opened in RightField at some point in the future. Having embedded each constraining term set as a very hidden worksheet, RightField then adds Excel Data Validation constraint information to each of the marked up cells, or ranges of cells, in the user worksheets. The Data Validation functionality used is taken directly from Excel. Using out-of-the-box functionality such as this is one of the reasons why RightField-enabled spreadsheets can be opened in Excel without the need for any special Excel Add-Ins or platform specific native code — both of which would decrease the robustness of the solution. In essence, the Excel Data Validation functionality links the ranges of cells in the very hidden worksheets to the human readable names for the ontology terms in the user worksheet. Given this markup ‘schema’, it is easy to map back from the values of cells that correspond to human readable term names to the IDs of the corresponding terms and the ontologies that define them.

image

Figure 3. The RightField implementation architecture.

Download figure to PowerPoint

During the design of the markup ‘schema’, the possibility of embedding complete ontologies into Excel Workbooks was considered. The motivation for this was that the exact copy of the ontology that defines the constraining ontology terms would then be preserved for future use and extraction, thus providing a robust provenance solution. With regard to providing such a solution, it is not difficult to imagine how an ontology could be encoded into a grid that could then be written into an Excel worksheet. However, some bio-ontologies contain hundreds of thousands of axioms that define hundreds of thousands of terms. Since older versions Excel, but still widely used, impose rather strict size limitations on the amount of data that can be stored in a worksheet, this solution was discounted at an early stage.

2.1 Virtue in simplicity

The simplicity of this approach—fixing and embedding terms within hidden sheets—is a significant feature to match the operational requirements of our target users (experimentalists and informaticians).

  • Ontology stability: The choices of terms are stable once the spreadsheet is established. This ensures that a series of experiments can be annotated with the same versions of the same ontologies. If an ontology needs to be updated, the spreadsheet must be actively uploaded and modified within RightField.
  • Offline working: Linkage to the ontology server is only required at the time the template is created. This enables experimentalists to add or view data offline and applications to view or process (with limitations) the data without a connection to an ontology server.
  • Cross platform support: Our experimentalists use Apple OS and Windows, often old versions of Excel or OpenOffice and do not want plug-ins. Thus macros, visual basic code or platform-specific libraries are out of bounds.
  • Standard Tools: Several metadata collection efforts have sought to take advantage of the spreadsheet metaphor by developing bespoke ‘spreadsheet like’ ‘data disposition tools’. However, our experimentalists wanted to use the mainstream, off the shelf tools they were already familiar with.
  • Distribution: The self-contained and plain nature of the resulting datasheets means that they can be stored, distributed and shared amongst scientists as normal, through email, shared file stores, USB sticks and so on.
  • Ontology flattening: Life science ontologies come in several formats (OBO, OWL, RDFS, SKOS) and are often broad and deep trees. Some, such as the Gene Ontology, have tens of thousands of terms. It is easy to become overwhelmed, especially as one data sheet requires multiple ontologies. By picking out the relevant terms for a particular context, we have merged and flattened these complex structures. A side-effect is the exposure of shallow ontologies with many subclasses and instances. This can result in long drop-down boxes, making the annotation process more difficult and prone to inaccuracies. Auto-complete functions for marked up cells can help relieve this problem but it actually highlights a wider issue with the structure of some of the ontologies. A community standard that specifies that users should select from over 50 terms, for example, may indicate that the ontology requires further development in that area. Defining more specific subclasses and splitting any instances between them would reduce this problem much more effectively and would benefit other users of the ontology.

2.2 RightField in practice

In the SysMO consortium, spreadsheets are prepared to conform to the ‘Just Enough Results Model’ (JERM). The JERM is the SysMO-DB minimum information model. It describes what type of experiment was performed, who performed it and what was measured. It defines and describes the relationships between SysMO assets (i.e. different types of data, mathematical models and experimental methods) and allows JERM representations to be mapped to MIBBI models where available.

Experiments in SysMO are related to one another using the ISA-TAB format [23]. Investigations, Studies, Assays Tab (ISA-TAB) is a tabular representation for describing how experiments (Assays) are grouped into wider Studies and Investigations. It allows multiple 'omics investigations to be linked together and described and provides a format to submit data into existing databases, such as ArrayExpress [24] and PRIDE [25]. This is a typical use for RightField. Research groups performing transcriptomics, metabolomics and proteomics experiments can use RightField-enabled templates to capture and therefore integrate the results from these multiple studies. The templates can ensure that biological entities in each data set (genes, metabolites and proteins etc.) are named consistently and that each data set is compliant to the appropriate standard (e.g. MIAME, Minimum Information About a Cellular Assay and Minimum Information about a Proteomics Experiment, respectively).

The combination of embedding ontology term selection and standardising the content and format of spreadsheets provides a simple mechanism to ensure data is consistently annotated and compliant with community standards, ‘by stealth’. The aim is to make it easier to use appropriate ontology terms than not to do so. The result for the SysMO consortium is a growing corpus of experimental data that is easier to compare and explore. RightField-enabled templates for transcriptomics data are currently the most widely used due to the more stringent reporting standards for transcriptomics publications compared with other areas. A collection of RightField-enabled templates developed for SysMO can be found here: http://www.sysmo-db.org/rightfield/templates.

The next step for SysMO-DB is the exploitation of the emerging corpus of RightField annotated data. By extracting and storing the data in RDF, based on specifications and mappings to the JERM ontology (http://bioportal.bioontology.org/ontologies/45133), we are able to provide further mechanisms for searching across the content of spreadsheets and publishing SysMO data as Open Linked Data [26]. Compliance to community ontologies means that the system can already use Web Services from the BioPortal for term lookup and visualisation.

2.3 Related work

The idea of using spreadsheets as a data collection mechanism is not novel. The developers of ISA-Infrastructure tools have produced a suite of tools to design and validate ISA compliant data [27]. Ontology terms can also be added via an Ontology Lookup Service [28] and a BioPortal Plugin. The ISA Creator GUI has the look and feel of a spreadsheet but can only perform some of the same functions and is not a standard tool. Scientists must download and install the software and learn a new environment, and more of the complexities of the ontologies are exposed making it unsuitable for experimentalists. Other ontology annotation tools in the life sciences include Phenote, which assists with the annotation of biological phenotypes (http://phenote.org), and the PRIDE Proteome Harvest Spreadsheet submission tool (http://www.ebi.ac.uk/pride/proteomeharvest/), which assists with the annotation and submission of proteomics data to the PRIDE public repository. These are powerful annotation tools for specific biological disciplines, and are not generically applicable.

3 POPULOUS: ONTOLOGY DEVELOPMENT BY STEALTH

  1. Top of page
  2. SUMMARY
  3. 1 INTRODUCTION
  4. 2 RIGHTFIELD: ONTOLOGY-ENRICHED DATA COLLECTION BY STEALTH
  5. 3 POPULOUS: ONTOLOGY DEVELOPMENT BY STEALTH
  6. 4 DISCUSSION
  7. ACKNOWLEDGEMENTS
  8. REFERENCES

The Populous tool [29] is an extension to RightField that changes the focus from data annotation to ontology development.

Developing logic-based ontologies in a language such as OWL or OBO is a specialised task, but gathering the domain knowledge that forms the content of the ontology requires input from the domain specialist [30]. Current ontology authoring tools, such as Protégé, require an understanding of OWL semantics and ontology construction, as well as the need for a consistent application of axiom patterns for related types of entities. These requirements suggest a separation of domain knowledge, and axiomatization in constructing OWL ontologies is needed. The tabulation of entities and the attributes that describe them is a common practice in developing OWL ontologies [31, 32]. Much of ontology's content can be authored as a set of axiom patterns or templates. Spreadsheets are commonly used to tabulate the information that is used to populate these patterns. For example, the Ontology for Biomedical Investigation [33] describes many forms of assay used in an investigation; each assay is described in terms of its analyte, evaluent and unit. Quick Term Templates [32] provide a form style table into which contributors to the ontology can add values that will form the ontology's content when combined with the relationships in the pattern associated with the template. As the details for each assay can be easily tabulated by a domain expert and the axioms for the ontology can be readily generated from this tabulated knowledge, a barrier to ontology development is overcome. This separation allows the domain experts that know about the assays to describe them in terms of other parts of OBI without writing complex OWL axioms and doing so in a consistent style across a large number of assay types. Using a spreadsheet-based approach enables domain experts to use familiar tools to achieve the complex task of ontology building.

Populous is a tool for gathering content and populating ontologies. It builds on the RightField codebase and makes use of the ontology loading and spreadsheet display functionality. Like RightField, it is also built on the premise that a tabular, template view is a natural way of gathering data. In this case, it is used to separate the processes of gathering content for the ontology from the process of ontology conceptualisation. The structure and axioms from the ontology are shielded from the user, who only sees the entities related to one another in the spreadsheet table. Consequently, Populous is most useful when a repeating ontology design pattern emerges that needs to be populated en-mass.

Consider the OBO cell type ontology [34] that contains a basic design pattern stating that every cell type has a nucleation, where the nucleation is one of the anucleate, nucleate, bi-nucleate or multinucleate. Populous can be used to create a simple template where column A captures the cell type and column B captures the nucleation value. RightField can be used to restrict the allowable values in column A to named cell types and column B to the set of valid nucleation terms. This template can either be exported or modified in Microsoft Excel, or users can choose to work directly inside the Populous RightField extension. Using Populous directly offers some additional benefits over standalone Excel, such as the ability to auto-complete and search for terms using wildcard string matches from within a spreadsheet cell. Figure 4 shows a populated template as it appears in the Populous extension to RightField. Data that matches terms in the validation list are highlighted in green, whereas incorrect or unknown terms are shown in red. The user is free to choose between whether they want to work only in Excel documents or if they want to work on the same files directly in Populous to benefit from some of the advance features Populous offers.

image

Figure 4. Populous, an ontology content population extension to RightField.

Download figure to PowerPoint

The template approach lowers the entry level for a user wishing to contribute to the ontology. Once the template is populated, the information can be translated into an OWL representation using the OPPL that comes packaged with the Populous extension. OPPL allows ontology design patterns to be specified in terms of OWL axioms that include variables for the differentia. Each row in the table represents a single instantiation of the pattern. The Populous OPPL wizard guides the user through the instantiation of a particular design pattern by binding columns in the table to variables in the axiom. The output of this process is a new OWL ontology along with a reproducible workflow that details how the ontology was built. Capturing knowledge using constrained templates improves the consistency with which the templates are populated. This reduces the amount of time required to validate the data and offers a way to automate the workflow for ontology construction. Building ontologies in this way means we capture the provenance of the ontology's construction, such as how each axiom was constructed and who generated the content. This separation of content from the axiom patterns offers new possibilities for refactoring of ontologies. Ontology developers can explore different conceptualisation from the same data captured in the template. A linked data application may only require a simple taxonomy from the data that could be represented in SKOS but if classification and reasoning were important, the design patterns could be substituted to build a more expressive ontology that both form the same template.

3.1 Populous in practice

RightField and the Populous extension are being used in the e-LICO project to develop ontologies relating to the KUP. The KUPO describes the kidney's cells, anatomy and diseases and is being used to capture metadata about experiments relating to renal physiology and disease. The KUPO re-uses many existing ontologies to build a much more detailed model of the kidney. Several design patterns for the KUPO were specified in templates built using RightField. Biologists with little or no prior knowledge about ontology development populated these templates using Populous. The designed patterns were expressed in OPPL and used to generate the final KUPO. The templates, OPPL scripts and generated ontologies can be found at http://www.e-lico.eu/kupo.

Initial evaluations showed that RightField-enabled spreadsheets produced datasets with greater consistency of annotation and compliance with standards. It is possible to over-ride the term selection and annotate cells with plain text, but users did not typically do so. Therefore, the annotation was consistent with the allowed terms from the ontology. If a suitable term could not be found, the default behaviour was to leave the parent class as annotation, which also produces consistently annotated data.

For ontology, users are not able to contribute new term suggestions without entering a curation and review process, so this behaviour is desirable, but for local ontologies (e.g. the SysMO-JERM and the KUPO), this mechanism could be used to identify and collect new terms by adding the Populous validation extension into RightField.

In Populous, if users add terms to marked-up cells that do not feature in the range of terms from the chosen ontology, Populous will highlight these. It could mean that these terms are errors, or it could mean that the relevant ontology needs to be extended with new terms. In each case, the tool assists with the curation process, identifying areas that require further investigation. In the development of KUPO, the biologists identified several gaps in the existing anatomy and cell type ontologies. Over 200 new terms can now be extracted from the populated template for candidate submission to the respective ontology development projects.

3.2 Related work

Protégé 4 ExcelImporter plugin and the MappingMaster plugin for Protégé 3 [31] generate OWL axioms. This and RDF conversion tools such as Excel2RDF (http://www.mindswap.org/~rreck/excel2rdf.shtml) XLWrap [35] and RDF123 [36] focus on the transformation of spreadsheet content but pay little attention to how the data are collected in those spreadsheets. When generating ontologies, in this way, a large portion of time is taken to ‘clean up’ and validate the populated spreadsheets. RightField and the Populous extension enable the validation of input data at the source and bridge the gap between the population on transformation of these spreadsheets. The template to which users of spreadsheets in this form have to comply is implicit, rather than being exposed as a collection of constrained entries to which the user complies. These constraints make it easier for the user to find the terms they need for annotating data and make the task of transformation easier; all in a stealthy form by using the very tools that the users deal with in their everyday tasks.

4 DISCUSSION

  1. Top of page
  2. SUMMARY
  3. 1 INTRODUCTION
  4. 2 RIGHTFIELD: ONTOLOGY-ENRICHED DATA COLLECTION BY STEALTH
  5. 3 POPULOUS: ONTOLOGY DEVELOPMENT BY STEALTH
  6. 4 DISCUSSION
  7. ACKNOWLEDGEMENTS
  8. REFERENCES

Many experimental biologists have no interest or experience in the use of ontologies but understand the value of semantic annotation and quality metadata when attempting to reuse data from others. For projects such as SysMO and e-LICO, the overall aims of the project include sharing and dissemination.

The more functionality that can be embedded in tools already in use, the more likely that semantic annotation will be considered part of the data management process. There are already a plethora of terminologies and information models available to support the consistent annotation of data, but providing such schemas is only the first step. It should not be an extra task for the data providers to describe their data in such a way that it can be interpreted and reused. This means that simple methods for adding semantics need to be embedded in the data management process.

RightField and Populous are flexible and extensible curation tools that make semantic annotation easier and more accessible to researchers: hence, annotation ‘by stealth’. They bridge the gap between informaticians and data producers as data can be annotated accurately and at source by the laboratory scientists who can also contribute terms to ontologies when needed. The advantages are the following:

  • Reduction in errors and the time it takes to annotate the data to comply with community standards;
  • Greater understanding through choice restrictions of annotation terms to a small and manageable set;
  • Greater validation to ensure data annotation quality and provide more feedback during annotation; and
  • Community intelligence for information model and ontology developers to incrementally improve the coverage and structure of the metadata standards they propose. The combination of the tools provides a convenient platform for prototyping and piloting data collection standards by trying them out to collect data, designing models through use. A similar approach was pioneered by the Pedro toolkit [37].

RightField and Populous use a simple, yet powerful and generic approach. The tools are already in use in large biological initiatives, but there is nothing in the applications that make them specific to the biological domain. RightField is already under evaluation for use for datasets in the oil and gas industry. Any domain with heterogeneous data that requires annotation to common schemas and ontologies could use them.

Excel2RDF, XLWrap and RDF123 enable the generation of RDF statements from spreadsheet data. RightField and Populous lay the foundations for the structured extraction of data and knowledge and present the possibility of contributing to Open Linked Data. Ongoing research will focus on extracting spreadsheet data into RDF for storage and comparison to fully exploit the large corpus of data that is being generated in the SysMO consortium and in the wider systems biology community.

REFERENCES

  1. Top of page
  2. SUMMARY
  3. 1 INTRODUCTION
  4. 2 RIGHTFIELD: ONTOLOGY-ENRICHED DATA COLLECTION BY STEALTH
  5. 3 POPULOUS: ONTOLOGY DEVELOPMENT BY STEALTH
  6. 4 DISCUSSION
  7. ACKNOWLEDGEMENTS
  8. REFERENCES
  • 1
    Howe D, Costanzo M, Fey P, Gojobori T, Hannick L, Hide W, Hill DP, Kania R, Schaeffer M, St Pierre S, Twigger S, White O, Rhee SY. Big data: the future of biocuration. Nature 2008; 455:4750.
  • 2
    Bateman A. Curators of the world unite: the International Society of Biocuration. Bioinformatics 2010; 26:991.
  • 3
    St Pierre S, McQuilton P. Inside FlyBase: biocuration as a career. Fly (Austin) 2009;3:112114.
  • 4
    Sansone SA, Rocca-Serra P, Field D, Maguire E, Taylor C, Hofmann O, Fang H, Neumann S, Tong W, Amaral-Zettler L, Begley K, Booth T, Bougueleret L, Burns G, Chapman B, Clark T, Coleman LA, Copeland J, Das S, de Daruvar A, de Matos P, Dix I, Edmunds S, Evelo CT, Forster MJ, Gaudet P, Gilbert J, Goble C, Griffin JL, Jacob D, Kleinjans J, Harland L, Haug K, Hermjakob H, Ho Sui SJ, Laederach A, Liang S, Marshall S, McGrath A, Merrill E, Reilly D, Roux M, Shamu CE, Shang CA, Steinbeck C, Trefethen A, Williams-Jones B, Wolstencroft K, Xenarios I, Hide W. Toward interoperable bioscience data. Nature Genetics 2012; 44:121126.
  • 5
    Taylor CF, Field D, Sansone SA, Aerts J, Apweiler R, Ashburner M, Ball CA, Binz PA, Bogue M, Booth T, Brazma A, Brinkman RR, Michael Clark A, Deutsch EW, Fiehn O, Fostel J, Ghazal P, Gibson F, Gray T, Grimes G, Hancock JM, Hardy NW, Hermjakob H, Julian RK Jr, Kane M, Kettner C, Kinsinger C, Kolker E, Kuiper M, Le Novère N, Leebens-Mack J, Lewis SE, Lord P, Mallon AM, Marthandan N, Masuya H, McNally R, Mehrle A, Morrison N, Orchard S, Quackenbush J, Reecy JM, Robertson DG, Rocca-Serra P, Rodriguez H, Rosenfelder H, Santoyo-Lopez J, Scheuermann RH, Schober D, Smith B, Snape J, Stoeckert CJ Jr, Tipton K, Sterk P, Untergasser A, Vandesompele J, Wiemann S. Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nature Biotechnology 2008; 26:889896.
  • 6
    Noy NF, Shah NH, Whetzel PL, Dai B, Dorf M, Griffith N, Jonquet C, Rubin DL, Storey MA, Chute CG, Musen MA. BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Research 2009; 37:W170173.
  • 7
    Brinkman RR, Courtot M, Derom D, Fostel JM, He Y, Lord P, Malone J, Parkinson H, Peters B, Rocca-Serra P, Ruttenberg A, Sansone SA, Soldatova LN, Stoeckert CJ Jr, Turner JA, Zheng J, OBI consortium. Modeling biomedical experimental processes with OBI. Journal of Biomedical Semantics 2010; 1(Suppl 1):S7.
  • 8
    Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 2000; 25:2529.
  • 9
    Rayner TF, Rocca-Serra P, Spellman PT, Causton HC, Farne A, Holloway E, Irizarry RA, Liu J, Maier DS, Miller M, Petersen K, Quackenbush J, Sherlock G, Stoeckert CJ Jr, White J, Whetzel PL, Wymore F, Parkinson H, Sarkans U, Ball CA, Brazma A. A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB. BMC Bioinformatics 2006; 7:489.
  • 10
    Rayner TF, Rezwan FI, Lukk M, Bradley XZ, Farne A, Holloway E, Malone J, Williams E, Parkinson H. MAGETabulator, a suite of tools to support the microarray data format MAGE-TAB. Bioinformatics 2009; 25:279280.
  • 11
    Bodenreider O, Stevens R. Bio-ontologies: current trends and future directions. Briefings in Bioinformatics 2006; 7:257274.
  • 12
    Rector A. Modularisation of domain ontologies implemented in description logics and related formalisms including owl. Proceedings of the 2nd International Conference on Knowledge Capture, Sanibel Island, USA, 2003.
  • 13
    Wolstencroft K, Owen S, Horridge M, Krebs O, Mueller W, Snoep JL, du Preez F, Goble C. RightField: embedding ontology annotation in spreadsheets. Bioinformatics 2011; 27:20212022.
  • 14
    Jupp S, Horridge M, Iannone L, Klein J, Owen S, Schanstra J, Wolstencroft K, Stevens R. Populous: a tool for building OWL ontologies from templates. BMC Bioinformatics 2012; 13(Suppl 1):S5.
  • 15
    Egana M, Rector A, Stevens R, Antezana A. Applying ontology design patterns in bio-ontologies. Knowledge Engineering: Practice and Patterns, Proceedings 2008; 5268:716.
  • 16
    Iannone L, Rector A, Stevens R. Embedding knowledge patterns into OWL. Semantic Web: Research and Applications 2009; 5554:218232.
  • 17
    Jupp S, Klein J, Schanstra J, Stevens R. Developing a kidney and urinary pathway knowledge base. Journal of Biomedical Semantics 2011; 2(Suppl 2):S7.
  • 18
    Atkinson M, De Roure D, van Hemert J, Michaelides D. Shaping ramps for data-intensive research. UK e-Science All Hands Meeting, Cardiff, UK, 2010.
  • 19
    Müller W, Krebs O, Rojas I, Wolstencroft K, Owen S, Alexejevs S, Goble C, Snoep J. SysMO-DB: just enough exchange for systems biology data and models. Microsoft Research eScience Workshop, Pittsburgh, USA, 2009.
  • 20
    Wolstencroft K, Owen S, du Preez F, Krebs O, Mueller W, Goble C, Snoep JL. The SEEK: a platform for sharing data and models in systems biology. Methods in Enzymology 2011; 500:629655.
  • 21
    Whetzel PL, Parkinson H, Causton HC, Fan L, Fostel J, Fragoso G, Game L, Heiskanen M, Morrison N, Rocca-Serra P, Sansone SA, Taylor C, White J, Stoeckert CJ Jr. The MGED ontology: a resource for semantics-based description of microarray experiments. Bioinformatics 2006; 22:866873.
  • 22
    Horridge M, Bechhofer S. The OWL API: A Java API for OWL ontologies. Semantic Web 2010. DOI: 10.1.1.163.7035
  • 23
    Sansone SA, Rocca-Serra P, Brandizi M, Brazma A, Field D, Fostel J, Garrow AG, Gilbert J, Goodsaid F, Hardy N, Jones P, Lister A, Miller M, Morrison N, Rayner T, Sklyar N, Taylor C, Tong W, Warner G, Wiemann S, Members of the RSBI Working Group. The first RSBI (ISA-TAB) workshop: “can a simple format work for complex studies?” Omics 2008; 12:143149.
  • 24
    Parkinson H, Sarkans U, Shojatalab M, Abeygunawardena N, Contrino S, Coulson R, Farne A, Lara GG, Holloway E, Kapushesky M, Lilja P, Mukherjee G, Oezcimen A, Rayner T, Rocca-Serra P, Sharma A, Brazma A, Sansone S. ArrayExpress--a public repository for microarray gene expression data at the EBI. Nucleic Acids Research 2005; 33:D553555.
  • 25
    Vizcaino JA, Reisinger F, Côté R, Martens L. PRIDE: data submission and analysis. Current Protocols in Protein Science 2010, Chapter 25: Unit 25 24.
  • 26
    Wolstencroft K, Owen S, Goble C, Nguyen Q, Krebs O, Müller W. RightField: semantic enrichment of systems biology data using spreadsheets. IEEE International Conference on e-Science, Chicago, USA, 2012.
  • 27
    Rocca-Serra P, Brandizi M, Maguire E, Sklyar N, Taylor C, Begley K, Field D, Harris S, Hide W, Hofmann O, Neumann S, Sterk P, Tong W, Sansone SA. ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level. Bioinformatics 2010; b26:23542356.
  • 28
    Cote R, Reisinger F, Martens L, Barsnes H, Vizcaino JA, Hermjakob H. The Ontology Lookup Service: bigger and better, Nucleic Acids Research 2010; 38:W155160.
  • 29
    Jupp S, Horridge M, Iannone L, Klein J, Owen S, Schanstra J, Stevens R, Wolstencroft K. A tool for populating ontology templates. Semantic Web Applications and Tools for Life Sciences (SWAT4LS), Berlin, 2010.
  • 30
    Suarez-Figueroa MC, Gomez-Perez A. NeOn methodology for building ontology networks: a scenario-based methodology. Proceedings of the International Conference on SOFTWARE, SERVICES & SEMANTIC TECHNOLOGIES (S3T 2009), 2009.
  • 31
    O'Connor MJ, Halaschek-Wiener C, Musen MA. Mapping master: a flexible approach for mapping spreadsheets to OWL. Proceedings of the 9th International Semantic Web Conference on The Semantic Web, Shanghai, China, 2010; 194208.
  • 32
    Rocca-Serra P, Ruttenberg A, O'Connor MJ, Whetzel T, Schober D, Greenbaum J, Courtot M, Sansone SA, Scheurmann R, Peters B. Overcoming the ontology enrichment bottleneck with quick term templates. International Conference on Biomedical Ontology, 2009.
  • 33
    Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nature Genetics 2001; 29:365371.
  • 34
    Bard J, Rhee SY, Ashburner M. An ontology for cell types, Genome Biology 2005; 6:R21.
  • 35
    Langegger A, Woss W. XLWrap - querying and integrating arbitrary spreadsheets with SPARQL. Semantic Web - ISWC 2009 Proceedings 2009; 5823:359374.
  • 36
    Han LS, Finin T, Parr C, Sachs J, Joshi A. RDF123: from spreadsheets to RDF. Semantic Web - ISWC. Lecture Notes in Computer Science 2008; 5318:451466.
  • 37
    Garwood K, McLaughlin T, Garwood C, Joens S, Morrison N, Taylor CF, Carroll K, Evans C, Whetton AD, Hart S, Stead D, Yin Z, Brown AJ, Hesketh A, Chater K, Hansson L, Mewissen M, Ghazal P, Howard J, Lilley KS, Gaskell SJ, Brass A, Hubbard SJ, Oliver SG, Paton NW. PEDRo: a database for storing, searching and disseminating experimental proteomics data, BMC Genomics 2004; 5:68.