• Open Access

MyExperimentalScience, extending the ‘workflow’

Authors


Correspondence to: Jeremy G. Frey, Chemistry, University of Southampton, Southampton, SO17 1BJ, UK.

E-mail: j.g.frey@soton.ac.uk

SUMMARY

Science, especially experimental science, has always depended on the careful capture of plans, actions, raw and processed data and conclusions. With scientific research now so inextricably dependent on computers, the use of an electronic laboratory notebook (ELN) is almost essential. The meticulous notebooks of Michael Faraday and other scientists of his era remain as role models for the recording that is necessary, but they cannot provide the essential support for discussion, sharing, collaboration and formal verification. A blog (a contraction of Web log) can form the basis for implementing an electronic notebook but does not suffice to meet all the needs of an ELN. This paper describes the LabTrove ELN, which is blog based but provides numerous additional features, such as version control, security policies and a flexible metadata scheme, and facilities for interchanging objects with other systems. The MyExperimentalScience project links LabTrove with myExperiment, a repository for workflows in a collaborative environment, thereby making LabTrove templates available for discovery and reuse. Collaboration, sharing and reuse are essential for scientific progress, which depends on individual scientists building on the results already produced by others. Open-source ELNs such as LabTrove are ideal vehicles to support the growth of Open Notebook Science. Copyright © 2012 John Wiley & Sons, Ltd.

1 INTRODUCTION

Experimental science comprises a range of activities: exploring ideas, recording conjectures, planning, setting up equipment, conducting tests, recording observations, analysing results and publishing findings. Philosophers have attempted to define these tasks as the scientific method, although in doing so never question the importance of recording. Significantly, as Poulter points out [1]:

The integrity of science as a discipline rests on the ability of scientists to reproduce the claims of others.

Results should always be open to testing and be capable of replication. The journal Organic Syntheses requires each procedure to be reproduced, as described, in an independent laboratory before publication. Although it is rare indeed to go to such lengths, failure to respect the creed of science can sometimes give rise to public concern, and embarrassment for those involved. The issue is not with cases of deliberate fraud but to be able to deal adequately with the discussion of the real difficulties in scientific research; reproducing experiments can be hard, results can be open to interpretation, but publication of full details as far as possible is the best way to advance science. This means process as well as data.

A recent example is the so-called ClimateGate affair [2], which arose from the publication of emails and other documents that researchers at the Climatic Research Unit, University of East Anglia, had thought were private. The BBC News item was one of many media articles exploring the consequential demands for greater public access to research data, for scientists to ‘show their working’ to the public as well as when subject to peer review. Of course, ‘show your working’ was a very commonly used phrase by our teachers, so there is nothing new here!

The scientists of old, such as Michael Faraday, were scrupulous about recording their experiments [3]:

Faraday's laboratory notebooks are also remarkable in the amount of detail that they give about the design and setting up of experiments, interspersed with comments about their outcome and thoughts of a more philosophical kind. All are couched in plain language, with many vivid phrases of delightful spontaneity…

In this modern era, with science now so dependent on computers, the quantities of data and other information are vastly greater. It is vital to maintain links between these objects: paper no longer suffices; the laboratory notebook has of sheer necessity had to become electronic. However, we must adapt the notebook not lose the art of keeping a full, accurate and informative record.

This paper describes the MyExperimentalScience project, which links the LabTrove electronic laboratory notebook (ELN) [4] with myExperiment, a platform that provides a repository for workflows in a collaborative environment [5]. A workflow is an explicit and precise description of a scientific process. LabTrove preserves the journal paradigm while implementing the features essential for capturing provenance and interchanging data and methods with other systems, such as myExperiment.

In recent years, there has been much activity in the preservation and dissemination of in silico experiments through the sharing of workflows (for example, Wilkinson et al. [6]). Frey and Bird have reviewed the principles of Web services and workflows in the context of drug discovery ([7] and references therein). In the context of this paper, the term workflow applies to the experimental steps performed by a scientist in vitro. In the past, these ‘real’ workflows would have been recorded in a paper laboratory notebook, so the only ways to share the procedure were to write a journal paper about the procedure or to expose full pages of the laboratory notebook. LabTrove enables the sharing and reuse of procedures, workflows and other information of common interest. UsefulChem [8] and Open WetWare [9] are other examples of initiatives with the same goals.

2 WHY ELECTRONIC LABORATORY NOTEBOOKS

The processes of experimenting, and of research in general, depend strongly on maintaining an accurate record of the steps taken and the stages of progress. Although the methods and techniques used vary considerably from one discipline to another, the significance of the record of intentions, practice and results is common to all research endeavours.

Whether the researcher is motivated by testing a hypothesis or by the careful collection of data to support an explanation of an observed phenomenon, the record is an essential support for discussion, sharing, collaboration and formal verification. It is now standard practice for funding agencies to require all the outputs of research to be discoverable and available for repurposing and reuse in follow-on work. Progress depends on individual scientists building on the results already produced by others.

In our view, a laboratory notebook should include thoughts and ideas as well as comprise a full record of all plans, actions, raw and processed data and conclusions. However this view of the nature of laboratory notebooks is not universal and is discipline dependent.

The case for the ELN is almost irrefutable. Maintaining records electronically simplifies both the discovery and exchange of information: investigators can search the full record for information of all types, from text to domain-specific data, such as chemical structures. All research generates data, often in large quantities, which the electronic record can link in both raw and processed forms. Importantly, the record can also link to the procedures, methods and workflows used. The ELN provides ready access to the data and the methods, thereby enhancing reuse and collaboration, which can be global. The value of electronic methods in the modern laboratory has been explored thoroughly by Frey [10, 11].

These characteristics are also the foundation for Open Notebook Science. Todd et al. argued cogently that their open-source approach, using the LabTrove ELN, was crucial to their research being accelerated [12]. They also discussed the other advantages accruing from openness.

Understanding and working with large data volumes is difficult without appropriate metadata or, to be precise, descriptive metadata. It is particularly important to capture—with metadata—the process by which scientific data are obtained. The true value of such metadata is realised only when data are stored and managed electronically. Metadata can and do exist with traditional forms of recording—indexes are the obvious example—but techniques such as filtering and faceted search are feasible only for electronic data. The metadata are also significant when data are exported to other systems (such as myExperiment).

In practice, ELNs range from electronic journals that are effectively paper replacements through integrated corporate systems that maintain audit trails and verifiable time stamps that can be used to protect intellectual property. A full discussion of the purposes and implementation of ELNs is beyond the scope of this paper, which is concerned with intermediate approaches that provide flexible forms of recording while maintaining control over provenance and security.

3 WHY BLOGS

A blog (a contraction of Web log) comprises a series of website entries, described as posts, usually in reverse chronological order. Although posts are commonly textual, they can also include images and other material. Blogs are often regarded as social networking tools.

In the context of laboratory notebooks, it is easy to relate the journal nature of a blog with the traditional written record. Blogs provide the basic functionality required for laboratory recording and for scientific discourse: investigators can imbue their entries with personal style; they can share information in the form of posts, data and other files; they can develop an informal classification scheme to aid future discovery and retrieval. One can imagine how the famous scientists of the past, such as Michael Faraday, might have adapted to the ‘blogosphere’.

Despite their manifest attributes, blogs have limitations. Although blogging is used to further the social aspects of science, blogs are rarely used for the science itself. However, the UsefulChem project [8, 13] is one example of an Open Notebook Science initiative that uses wikis and blogs. In general, the use of blogs tends to be restricted by concerns about the integrity of data and of the posts themselves, arising from a lack of version control and appropriate security provisions, or access control. Although free access can promote collaboration, the pursuit of knowledge sometimes requires access to be restricted.

Thus, although blogs do provide the basis for flexible recording in a laboratory context, without additional features, they do not suffice to meet the needs of an ELN. Providing the necessary features was the motivation for developing LabTrove.

4 WHY LabTrove

Version control is an essential feature of LabTrove. Records in paper notebooks almost inevitably contain corrections and other modifications, but these are clearly visible. User interfaces usually display only the most recent version of an electronic record. LabTrove allows users to edit their own posts while maintaining a complete change history. All edits must be annotated with a reason for the change, and each version bears a date and time stamp that is protected and cannot itself be modified. It is therefore possible to trace back through the individual versions of a post, thus preserving its provenance.

LabTrove also implements security and access control policies, ranging from all users being able to edit all items to all forms of access being password and account controlled.

Experiments generate data, often in large quantities, which is stored in files and repositories. LabTrove posts relating to the raw and processed forms of that data incorporate links to the storage location. Such links are expressed as uniform resource identifiers, as are references to other posts, thus forming the basis for the one-item-one-post principle on which LabTrove is founded.

The following items comprise a non-exhaustive list of what a post can be about: data, a material, an observation, a plan, a procedure and a sample. In each case, the item is represented by a uniform resource identifier, and the links can be followed to explore relationships and, for example, to understand the provenance of a conclusion.

Attention to provenance is fundamental to the design of LabTrove. Investigators can follow links between items and thereby discover not only the process that led to a specific item but also how LabTrove represents that process. This model is in keeping with the spirit of the architecture described by the Provenance Project [14].

LabTrove is not intended to be a stand-alone tool. It incorporates a plug-in architecture to enable a principled approach to embedding content from external services, and it also includes an RSS feed. This feed mechanism enables data reuse by other applications and provides for notification and automated tracking. The foregoing outline is intended to provide only a summary of LabTrove functionality: a full description will appear in a paper about the use of LabTrove for electronic recording in Chemistry laboratories, which is currently being prepared by Frey, Milsted and Neylon. The purpose of this paper is to examine the role of LabTrove in a collaborative environment that depends upon the open sharing and reuse of workflows and experimental plans.

5 METADATA AND TEMPLATES

Maintaining a full record is just one aspect of managing an ELN: other activities, such as analysis, exchange and dissemination, depend strongly on reliable metadata. An experimental record is arguably incomplete unless the required metadata have been captured.

Metadata capture at source is clearly more efficient than relying on a subsequent curation process. However, for the metadata to be reliable, its acquisition should not become a burden for the user of an ELN. Moreover, the extent of the meta-information required should be no more than necessary: prescriptive form filling has to be avoided.

LabTrove meets these general needs with a flexible and extensible system for metadata capture and deployment. The one-item-one-post principle ensures that each post retains its individual characteristics, which are represented in the value assigned to the Section key, which is the only mandatory item of metadata. All other metadata are optional, in the form of key-value pairs. Individual users can organise their metadata scheme to suit the specific objectives of their project, while larger projects have the option to introduce a classification model to facilitate analysis and other activities that depend on being able to filter posts according to need. Figure 1 shows an example of the use of the LabTrove system for a biological chemistry experiment on neutral drift from the Neylon group [15].

Figure 1.

An example of the use of the LabTrove blog system as an electronic laboratory notebook for an Open Notebook Science biological chemistry research project on neutral drift [12]. The blog is available at http://blogs.chem.soton.ac.uk/neutral_drift [15]. (a) Part of the timeline view of the blog posts and the titles in the timeline link to the blog posts (http://blogs.chem.soton.ac.uk/neutral_drift/timeline.html); (b) part of a blog post summarising a set of PCR experiments with links of the products and purified product (http://blogs.chem.soton.ac.uk/neutral_drift/13719/PCR_for_DNA_shuffling_502589.html); (c) one of the product posts (http://blogs.chem.soton.ac.uk/neutral_drift/13705/PCR_for_shuffling_502589_product_1.html); and (d) one of the purified products posts (http://blogs.chem.soton.ac.uk/neutral_drift/13706/PCR_for_shuffling_502589_product_2.html).

When a project has adopted a metadata scheme, it will be important to present users with input fields appropriate to the metadata keys required. Furthermore, for ease of curation, the collection and recording of data and metadata should, insofar as is feasible, be automated. For these reasons, LabTrove provides templates.

The use of templates encourages consistency and facilitates the acquisition of metadata that is standard for specific types of post. Tables provide a simple but powerful illustration of the value of templates. The normal text entry mode for blogs is cumbersome when introducing tables, yet these are the natural choice for presenting a set of results. For example, when a series of reactions is carried out in parallel, a table is the natural way to present the input materials and to identify the products of each reaction. Figure 2 shows the use of the template to help collate a series of PCR experiments and link the reactants, reactions and products together. The template helps to encourage this structured approach to the records.

Figure 2.

(a) An example of a template used in the neutral drift project blog (http://blogs.chem.soton.ac.uk/neutral_drift/4200/DNA_Shuffling_Template__part_1_DNaseI_digestion.html); (b) a stage in using the template illustrating the dropdown menu with the list consisting of the blog posts that match the metadata selection from the table column; and (c) the completed post (http://blogs.chem.soton.ac.uk/neutral_drift/13719/PCR_for_DNA_shuffling_502589.html).

The LabTrove template system can use metadata key-value pairs to populate rendered templates with appropriate posts representing related objects. The effective use of templates depends on a consistent and organised approach to metadata structuring. To capture process details easily, metadata should map closely to how those processes work in practice. When the templates are created and used, the dropdown lists reflect the metadata items, encouraging their use. Figure 2 illustrates the creation of a post from a template, in which the cells in the second column contain a metadata key value. LabTrove uses that value to produce a list of sample posts, one of which the researcher selects.

6 LabTrove AND myExperiment

The LabTrove ELN embodies the one-item-one-post principle, which enables all types of data to be viewed and integrated, and is the feature on which the preservation of provenance depends. LabTrove also implements a flexible metadata scheme, with the ultimate aim that its metadata be capable of interpretation by both humans and machines.

Another important function of any laboratory record is to record the procedures, methods and workflows used. With traditional paper notebooks, a description of the method would be transferred to a published paper to enable other researchers to replicate the procedure. The scientific literature as it stands rarely, if ever, provides sufficient detail to replicate the detail of a published study. With ELNs, methods can be made available in a variety of electronic forms, particularly as workflow, which is an explicit and precise description of a scientific process. From a scientist's viewpoint, they are a flexible and reliable way of automating a scientific method, enabling the sharing and reuse of the techniques embodied.

In common with other information that is made available for sharing, reuse or repurposing, workflows must be discoverable and accessible, which implies the need for repositories and exchange mechanisms. The largest public repository of scientific workflows is now myExperiment [5].

However, myExperiment offers more than workflows to the researcher. It is a repository of research objects, such as documents and media files, as well as methods; it also acts as a community hub for scientists who want to share information and ideas. In addition to providing a platform for running scientific services and linking scientific resources, myExperiment offers a social virtual research environment.

The myExperiment project is part of the myGrid project [16], which is also responsible for the Taverna Workflow Workbench [17-19] for creating and running scientific workflows, as illustrated on the website.

LabTrove is not alone in enabling the sharing and reuse of procedures, workflows and other information of common interest. Other examples are UsefulChem [8] and Open WetWare [9]. As well as being a source of procedures and methods, LabTrove can itself function as a workflow enactor. For example, a template can be constructed to lead the experimenter through the stages required to complete a given procedure. Moreover, by using its API, LabTrove could be driven from a Taverna workflow.

The MyExperimentalScience project links the LabTrove ELN with the myExperiment platform, using its collaborative features and thus enabling a wider community to discover and reuse LabTrove templates. To achieve this link, two aspects of the LabTrove export were combined. LabTrove can export individual posts as XML and as a PNG image; for integrating LabTrove templates with myExperiment, both types of export are required.

The exported XML from a template includes the code required to generate the template form, such as table definitions; the metadata keys with prompts for the nature of the values expected; and packaging information that characterises the template post, such as its title and author. Examples of the code used to describe tables and metadata are provided in the LabTrove documentation [4], and the syntax will be fully discussed in the paper, mentioned earlier, about electronic recording in Chemistry laboratories.

The PNG image provides prospective users of the template with a view of the layout of the post and a preview of the data and metadata that they would be expected to supply. On logging in to myExperiment, one can upload the template to myExperiment from the XML file, together with the preview image (Figure 3).

Figure 3.

(a) The myExperiment upload of the template from Figure 2, showing some of the extracted metadata and the ability to set the sharing controls in the usual manner in myExperiment, and (b) the view of the template from within myExperiment.

MyExperiment receives the XML package from LabTrove and uses the content to fill in many of the essential myExperiment metadata fields. The author can then select the required sharing and collaboration details. The template then appears in myExperiment as shown in Figure 4, which shows the result of a search on myExperiment for LabTrove templates. Other researchers can search myExperiment for workflows of the kind: (LabTrove Template). Using the preview image, they can confirm that the template meets their needs, download it from myExperiment and copy it into their LabTrove instance.

Figure 4.

The result of a search of myExperiment for LabTrove templates.

Earlier, the laboratory record was described as ‘essential support for discussion, sharing, collaboration and formal verification.’ The combination of the LabTrove ELN with the community repository provided by myExperiment facilitates all four of those objectives.

7 CONCLUSION

Scientific advances are already being made through the use of an open, online electronic notebook: resolution methods for the production of a drug for the treatment of schistosomiasis were developed collaboratively through the use of LabTrove [20].

By extending the blog concept to include version control, secure access and a consistent metadata scheme, LabTrove provides an ELN system that combines the best aspects of the traditional journal with the advanced capabilities needed for 21st century science. In particular, the integration with myExperiment enhances the scope for open sharing and exchange of data, methods and other objects of scientific value.

LabTrove already implements a plug-in mechanism for embedding content from external services. Emerging standards for object interchange will increase the potential for tighter integration with tools such as myExperiment, obviating the need for manual export and upload processes.

Progress depends on scientists building on the outputs of other scientists, and the data, methods and procedures available to the wider scientific community will continue to become ever more extensive. Tools such as LabTrove, which not only facilitate the capture of scientific outputs but also enhance their interchange, are vital for progress.

ACKNOWLEDGEMENT

The authors acknowledge funding from JISC and EPSRC (CombeChem GR/R67729, EP/C008863, and e-Research South EP/F05811X). The authors are very grateful for the help of Colin Bird in preparing the manuscript.