Scientific discovery is increasingly driven by the collection, analysis, and comprehension of digital data. Collaborations between domain scientists and computer scientists can accelerate both the investigation and applications processes. The Microsoft eScience Workshop is a recognized venue for showcasing such collaborations and serves as a forum for exchanging both domain and computational researches. This editorial provides an overview of the papers that resulted from selected research collaboration presented at the 2010 Microsoft eScience workshop. Copyright © 2012 John Wiley & Sons, Ltd.


In October of 2010, Berkeley welcomed over 200 domain and computer scientists participating in Microsoft's eScience Workshop. The theme of ‘Scaling the Science’ provided opportunities to observe how eScience has provided scaling across various fields and to explore some of the challenges that remain for realizing the ambitions of the fourth paradigm. The goal of this seventh annual cross-disciplinary workshop was to bring together scientists from diverse research disciplines to share their research and discuss how computing is transforming their work. The current opportunities in the physical and biological sciences and their technological applications require the means to fundamentally understand processes at the molecular scale and to extend those processes to predict performance at larger scales. As examples, material science is using resolution at the scale of an atomic to predict and design devices that are orders of magnitude larger; and biological processes are dictated by interactions at molecular, cellular, organismal, population, and ecosystem levels. Spatial and temporal scaling across orders of magnitude requires analysis tools that are available for computation, aggregation, and visualization. eScience is developing approaches for conducting this scaling and has been essential in addressing fundamental questions in biology and astronomy. While additional applications remain in the basics sciences, these fields have demonstrated pathways for advances in the applied environmental and social sciences where the linkages between scales and disciplines require focused contributions from the eScience community. This Workshop provides opportunities to observe how eScience has provided the scaling across various fields and to explore some of the challenges that remain.

Presenters spoke to the value of and challenges inherent in scientific collaboration and recognized that reuse of data is now essential to scientific progress. This progress also depends on scientists building on the outputs of other scientists, and the data, methods, and procedures available to the wider scientific community will continue to become ever more extensive.

The paper by Bourne [1], a professor in the Department of Pharmacology and Skaggs School of Pharmacy and Pharmaceutical Sciences at the University of California at San Diego, charts a course to deliver more meaningful data. Entitled the ‘The Reaming of Life,’ the paper explains that although it may be easy to punch a hole in a piece of metal, a reamer is needed to accurately size and finish that hole. Digital computers are the reamers of life, bringing together a vast array of disparate bits of data to provide an accurate picture of life that can be smoothly transcended across scales—from molecules to populations.

Hunter et al. [2] explain how technologies are enabling citizens to actively participate in ‘citizen science’ projects by contributing data to scientific programs via the Web. Specifically, they describe how online social trust models can provide a simple and effective mechanism for measuring the trustworthiness of community-generated data and how filtering services that remove unreliable or untrusted data enable scientists to confidently re-use citizen science data. The resulting software services are evaluated in the context of the Coral Watch project—a citizen science project that uses volunteers to collect comprehensive data on coral reef health.

Wolstencroft et al. [3] describe new methods for building a better bridge between the semantic data annotation and the laboratory scientist. Using unobtrusive ‘stealthy’ methods for collecting standards compliant, semantically annotated data, and for contributing ontologies used for those annotations, the authors have built upon the ubiquitous spreadsheet to enable scientist to structure information and select ontology terms without changing common workplace practices.

Frey et al. [4] explore the role of the online electronic notebook plug-in Lab Trove in the production of a drug for the treatment of schistosomiasis. By extending the blog concept to include version control, secure access, and a consistent metadata scheme, LabTrove provides an Electronic Laboratory Notebook system that combines the best aspects of the traditional journal with the advanced capabilities needed for 21st century science. In particular, the integration with myExperiment enhances the scope for open sharing and exchange of data, methods, and other objects of scientific value.

The paper by Greenfield [5] describes an approach to analyzing DNA data and quickly answering certain types of ad hoc biological questions. The work is based on simple and fast exact matches of k-mer strings (short string of DNA) using a database, rather than conventional alignment based on inexact matches of much longer strings. These k-mer tools provide ways of rapidly exploring large genome spaces and handling large volumes of sequence data and complement rather than replace existing alignment and assembly tools.

We extend our appreciation to the authors developing their workshops presentations into the body of work comprising this special issue. These papers represent a small subset of outstanding material presented at the October 2010 workshop. We encourage interested readers to explore the video presentations, whitepapers, and tutorials available at http://research.microsoft.com/en-us/events/escience2010/agenda.aspx