• Open Access

The reaming of life: based on the 2010 Jim Gray eScience Award Lecture


Correspondence to: Philip E. Bourne, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA.

E-mail: pbourne@ucsd.edu


We are well into the era of data intensive-digital scientific discovery, an era defined by Jim Gray as the Fourth Paradigm. From my own perspective of the life sciences, much has been accomplished, but there is much to do if we are to maximize our understanding of biological systems given the data we have today, let alone what is coming. In my 2010 Jim Gray eScience Award Lecture, I gave my own thoughts on what needs to be accomplished, and with an additional year of hindsight, I expand on that here. Copyright © 2012 John Wiley & Sons, Ltd.


The original title of my Jim Gray Award Lecture, delivered in Berkeley, CA, on October 10, 2010 was The Reaming of Life [1]. As a play on the words of Monty Python, the idea was simply to provide the cue cards to prompt a conversation [2], a conversation started by Jim Gray in his vision of the Fourth Paradigm—a vision that is now a reality—the vision of scientific discovery driven by the collection, analysis, and comprehension of digital data by an ever-increasing interdisciplinary community of professional and citizen (garage) scientists. The digital data deluge has punched a rough hole in our thinking and the process for doing science. A reamer is a tool that turns that rough hole into a smooth conduit through which scientific discoveries can be made at an ever-increasing pace in a digital world. My thesis is that we need those reamers now, but we need to overcome technical, social, political, and economic barriers in providing those tools. My thesis is inevitably drawn from my work and my experiences as a computational biologist, but I believe that scientists from domains other than the life sciences will see the generalities in what I have to say, for digital data have no boundaries.


The first reported occurrence of the H1N1 influenza virus occurred between 1918 and 1920 and killed at least 3% of the world's population at the time and infected 27%. It was no surprise then that, in an era of broader contact between people, a second occurrence of an H1N1 strain in 2009 should receive significant attention. Fortunately, the 2009 strain proved to be far less virulent. I would like to make two points that a potential crisis makes apparent. First, unencumbered access to scientific data can prove critical in a time of crisis. I cannot prove that this is true; what I do know is access to such data does increase significantly when it is available, as Figure 1 shows. Figure 1 illustrates access to two particular protein structures known to be influenza virus drug targets found in the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB), a resource I am proud to be associated with. The PDB is an open and freely accessible archive of the molecular three-dimensional structures of all publically accessible biological macromolecules, such as proteins, DNA, RNA, and their complexes. Access to these data correlates with the accumulative number of cases of H1N1 reported in the USA. Why is this important? Everyone from K12 students to Nobel laureates had immediate, free, and unbridled access to well-validated and annotated data that were shown to be important in addressing a global health issue—the more reported cases, the more access to the data—presumably to try and solve the crisis. This is the Fourth Paradigm in action as I imagine Jim Gray would have hoped, but there is much still to do if scholars are to be able to respond most effectively in a time of crisis, or at any time for that matter. Well-described and annotated high-quality data remain the exception rather than the rule, and data producers for the most part, while being increasingly incentivized to make data available, through, for example, government funding mandates for data sharing, still pay too little attention to metadata, provenance, quality metrics, and the like. I will come back to this.

Figure 1.

Increased data access during a time of crisis. At the top right is the accumulated number of H1N1 cases, as reported by the Centers for Disease Control (CDC), during the 2009 outbreak. At the bottom are Google Analytics for access to two structures from the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB) that are influenza virus drug targets. There is a correlation between the number of cases and access to these data. This figure is provided by Andreas Prlić.

The second point that the H1N1 pandemic of 2009 illustrated to me is that traditional modes of scholarly communication just do not work in a time of crisis. The process is too slow, does not reach enough stakeholders, and does not provide all that is needed to mount an effective response. The Public Library of Science (PLoS) in collaboration with the National Library of Medicine responded to the H1N1 crisis in a way that may well foretell the future of scholarly communication; they created PLoS Currents: Influenza [3], a simple repository where developments important to understanding the outbreak could be uploaded, quickly moderated, and commented on as they were happening. Such points of data and knowledge aggregation, with some form of community moderation and review, could conceivably become the scientific journals of the future. Before we become too enthusiastic about the prospect, it is important to note that contributions to the site are correlated to the state of the pandemic. Post the potential crisis most contributors have fallen back upon more traditional modes of scholarly communication, which is where the reward lies. Again, I will come back to this.


Science has undoubtedly become more open since Jim Gray first proposed the Fourth Paradigm. This is a good thing, but in my opinion, we still have a very long way to go. A tragic yet compelling example of this comes not from an example impacting the world's population but from an example impacting an individual. I refer you to Josh Summers and his battle with Chordoma disease. I encourage you to listen to his story first hand [4]. In a nutshell, as a young man, he was diagnosed with a fatal disease, but not being one to give up, he volunteered in a laboratory studying the disease and was distressed to find out how little science was shared prior to publication, between the laboratory and the community, and even between members of the same laboratory or close collaborators. A troubling bottleneck to discovery of a treatment is when you only have a short time to live! Differing viewpoints on data and knowledge sharing are well documented [5], and I will not belabor them here. My sense is that attitudes are changing, brought about by a connected Web 2.0 world, and future generations of scientists will not only be more used to sharing but will be more compelled as science becomes even more collaborative. Let us not wait until then. Elders such as myself are involved in promotion decisions, and I have written (albeit for my own discipline) [6] that we have an obligation to educate decision makers as to the importance of rewarding open science. It is a small step, and one I hope other elders, funders, and indeed all stakeholders in scientific discovery will take.

Perhaps the biggest step towards open science is coming top down from open-access publishing. Although not yet a viable business model for some disciplines, for others, such as the life and physical sciences, it has been shown to work. The PLoS [7], with which I am proud to be associated, is an example of what can be accomplished. It is estimated by the end of 2012 that PLoS will publish 1-in-30 of the journal articles published in the biosciences; and PLoS is only one of a number of important open-access publishers in the biomedical sciences. Couple that with mandates from funders such as the National Institutes of Health and the Howard Hughes Medical Institute in the USA and The Wellcome Trust in the UK to support open access to the research they fund and we have a movement. Not surprisingly, traditional publishers are resisting this change and the loss of revenue it brings. However, if you look at changes in business models that have already occurred in support of newspapers, music, and most recently, books, changes, even total disruption, to science, technology, and medical publishing would seem inevitable. The key to future success would seem to be in leveraging that open content through new business models. I speak of this change from the point of view of business models, because those models provide persistence and longevity, which are prerequisite to science and scholarship. Sensing this change in how content is delivered, communities are emerging to drive that change as exemplified by FORCE11, a group of diverse stakeholders with the intent of facilitating a change in how scholarship is communicated [8].

I described open-access publishing as top down. What I really mean is it is approaching the problem of open science from the end of the research cycle—the final publication. There are signs that it is beginning to influence earlier stages of the research pipeline. Data sharing through established data resources and emergent institutional repositories are signs of change earlier in the discovery process. The emergence of Facebook for science resources such as ResearchGate [9] and BiomedExperts [10] are perhaps the beginning of sharing even further down the pipeline towards the synthesis of new ideas. Whatever the future holds, it cannot be decoupled from reward. Reward must be an inherent part of the system, as again I will come back to.


In an analog world, the distinction between scientific data and the knowledge derived from those data was understandable. The cost and means of data accessibility were inhibitive, and we were satisfied with a hard-copy journal article describing the end product. In a digital world, for no rational reason, this separation remains to the detriment of scientific progress, that is, either the data are not shared, or in disciplines such as my own bioinformatics, which are formed on shared data, those data remain disconnected from the knowledge derived from those data. As I [11] and others have written [12], it is ironic how much public money is spent retrofitting databases to be more like journals and journals to be more like databases. Databases hire biocurators to extract appropriate information from the literature and add it as annotation to the data contained in the databases, and journals are figuring out ways to provide the data upon which scientific articles are based, or more recently just calling themselves data journals. The reward system for scholarly accomplishment has much to do with the current situation (sorry about being a broken record). Suffice to say here that although the reward is publishing a paper and not depositing a dataset, never mind that the data are downloaded much more than the paper is cited, change recognizing the value of shared data will not happen very rapidly. Publishers not having the expertise, resources, or perhaps the will to effectively manage large data exacerbates the situation. There are promising signs of change, nevertheless. Research funding agencies are insisting on data sharing and to a lesser extent preservation. Institutional repositories also provide options. Partly in response to funders and partly under pressure from scientists themselves, entities such as Dryad [13] and DataCite [14] have emerged as public data repositories and source of data citation through digital object identifiers, respectively. Data-oriented journals are also emerging (witness F1000 Research [15] and many others [16]). What are missing, and I feel will come back to haunt us if not addressed broadly, are the metadata standards to properly characterize the increasing number of deposited datasets. Dublin core [17] is a start but not enough to provide the descriptions needed to make the data effective. Assuming the metadata issue is addressed, also missing are the tools to effectively deposit data, discovery tools to find the data, and metrics to define how the data are used. In a nightmare scenario, data provision could be twice as onerous as publishing. Publishers use a variety of different manuscript processing systems, each distinct, which wastes significant author time, unless their manuscript is accepted on the first try (I wish). But at least there is some uniformity in the final product. A scientific research article follows a standard that mirrors the scientific process itself—introduction, materials and methods, results, discussion—and is easily navigated, which is not true with data repositories. Each has a distinct way of storing the data in and a distinct way of finding the data and taking those data out. In some cases, data repositories have been likened to ‘roach motels’—data check in, but they do not check out [18]. The various communities of data providers within and across domains need to communicate better and be prepared to compromise on approaches to better serve the customer base. Institutional repositories only make sense if they serve a global community, not just the institution. Ask the stakeholders, the faculty, and students at those institutions what they want. For the most part, it is global access to the content they provide.

Figure 2 illustrates a vision from my own domain for how a research article that is integrated with the data might look in the future. Imagine how this would look in your own domain. Elements of this vision do exist today as pointed out at each step. Step 1 (top left) should look familiar; it is the typical HTML-based view of a journal article that one sees today. The distinction is that this is only one of a number of views of the scholarly discourse, not the only one as is true today. The PDF does not count as a different view from the HTML! We need to move ‘Beyond the PDF’ [19].

Figure 2.

One view of the future of knowledge and data integration. PDB, Protein Data Bank; PLoS, Public Library of Science.

Step 2 highlights a particular image in the paper in the way that one enlarges thumbnails today, but there is a difference. The image is a cue to retrieve the data that generated the image via Web services from an integrated data repository. In this case, it happens to be a molecular structure where the data are retrieved from the RCSB PDB. This is actually possible today from the PDB should publishers wish to use this feature. The image can then be interactively manipulated. However, such retrievals do not generate the exact view in the paper. That specific view came as a result of a detailed analysis by the authors. In other words, the image captures a significant amount of knowledge not found in a generic view of the data but rather from significant post-analysis. This knowledge is not captured in any form today, but it would be relatively straightforward to do so for this type of data. A script of commands can be run against the generic view of the molecule generated from the retrieved data to obtain the exact view in the paper. The end result is rather than a static image; there is an interactive view of the molecule that the reader can use to gain a further understanding, and perhaps make discoveries of his or her own. In general terms, such a script can be considered metadata for providing a semantically enriched view of the data described by the narrative found in the text. What has been added by the inclusion of the metadata is an element of further discovery not accessible from any journal article or database that I am aware of today. Starting from the same point as the authors, the reader is free to explore further. It is not that this cannot be done today, it cannot be done within the context of the paper, and it cannot be done easily. Like so much of science, this example illustrates that a large percentage of time is spent preparing the data for discovery, rather than the discovery process itself. The metadata and the application to render the image were available to the authors but lost to the reader as part of the publication process—a significant inefficiency. As yet, no publisher has engaged to bridge this gap. The need is to retain and recall metadata on demand and have a complement of plug-and-play widgets for rendering specific data types. Computable document format as integrated within Mathematica [20] provides this capability and more, but the widest adoption is going to require access from a Web browser rather than a specific client. HTML5 has the promise of more seamless application integration.

Step 2 (Figure 2, bottom right) introduces yet another feature already appearing in a number of Web 2.0 contexts: the ability to annotate the image and to save those annotations so that they can be shared either narrowly or broadly. Today, the success of providing and sharing such science-based annotations is mixed and depends on the medium. Annotating peer-reviewed publications, if provided by the publisher, is scantily used, which might be because of lack of reward and lack of anonymity, if it were not for the fact that a significant commentary on that paper may be found in the blogosphere. Social bookmarking, external ranking, and annotations provided in systems such as Mendeley [21] are all perturbing the traditional system of peer review and letters to the editor. Such perturbation is immediate after publication and is a first step towards what Josh Summers would like to see. The question remains as to how much this will perturb the publication system itself. In some domains, that perturbation happened years ago, witness the impact of arXiv.org on physics, but this remains the exception rather than the rule.

Step 3 (Figure 2) of our data–literature integration scenario depicts the idea of using the ‘journal article’ as a query interface from which new discoveries can be made. In our hypothetical example, a region of the molecule is semantically tagged. With that tag used, a combined data and knowledge base is queried and a mash-up of relevant findings presented. This goes beyond a hyperlink that was generated automatically or manually, or a standard database query. This is a new kind of discovery process where semantic reasoning takes place. We have not yet begun to understand in a broad way how semantic reasoning across a diverse corpus of scientific data and associated knowledge about those data can be done and findings ranked according to the needs of the individual.

Finally, like the research process itself, this hypothetical discovery process is also cyclic. From the mash-up, a further inquiry is made in Step 4 (Figure 2), and a new paper view is revealed, and so the scientific discovery process continues.


What I have described earlier is one post-publication view of the future. Equally important is new thinking about how we reach that research endpoint. Having the data and the knowledge derived from those data in digital form omits an important component: the methods used to turn data into knowledge and the process that was undertaken to apply those methods. Because the methods are increasing digital in the form of software, they too can be captured. Putting them all together, what I am describing are workflows. In science in general, workflows still remain somewhat of a novelty. They are frequently used when the scientific process is well defined and not subject to constant change, for example, in the use of Pipeline Pilot by the pharmaceutical industry. They are also adopted by specific disciplines that recognize their value, for example, the use of MyExperiment [22] by bioinformaticians or the use of Galaxy [23] by the high-throughput sequencing community. As the technology improves, and workflow tools move more from research to development, adoption will further increase by virtue of the productivity they bring. Further adoption will require being rewarded for using and sharing workflows.

Workflows capture the process of science and, as a consequence, capture the software methods in a way that is open and persistent. There are other ways to capture software methods that are in common use; repositories such as sourceforge [24] and github [25] are examples. Again, we do not have a reward system for depositing software as part of the scholarly process. Publications about that software are where the reward lies. However, those papers are often not read, and the associated citations of such papers do not necessarily correlate with the actual use of the software.

There seems to me to be another model for scientific software that has yet to be considered—the app model. Suppose I want to apply a method to a well-characterized set of scientific data. Today, I rely on finding it in one of the repositories, or asking colleagues for advice on what software to use, or referring to papers where that software was applied. There is neither a comprehensive view of what software is available for the task nor a nonbiased viewpoint as to the value of that software. Moreover, if there are two software applications that report to perform the same task, it is unlikely that they have user interfaces that are in any way intuitive or similar. These issues do not exist in an app store. Software is easily downloaded and installed, the interface is usually intuitive, and how much the software has been used and how it is rated are easily accessible. It is time we had an app store for science.


It should be apparent from this discussion that scholarship consists of much more than a final publication. But ironically, it is publication alone that defines the scholar—this is an antiquated view of scholarship. Even then, the typical measure of the value of that publication is the impact factor (or perceived quality) of the journal in which it appears. The irony of such a quantitative discipline measuring the value of its members in such a qualitative way should not be missed. The discomfort that some scientists feel both in terms of what is measured as scholarship and how it is measured is leading to some change. Again, the digital medium offers a perfect environment to foster change. Article-Level Metrics [26], which measure the impact of an individual paper in an online journal, is an example of this change. Before Article-Level Metrics, citations could be obtained, but citations do not necessarily tell the whole story of scholarship. From my own work, I can say I have a paper that has been cited over 10,000 times, but hardly anyone has ever read it. The only reason that paper exists is because it is an accepted way to gain credit for the database described in that paper. Why not give credit on the basis of how the database is accessed? Conversely, I have papers that are hardly cited but have been downloaded tens of thousands of times. Many are associated with professional development, but is that not a form of important scholarship that should be credited? In short, the reward system does not adequately address the different forms of scholarship, but in the digital world, we can do better.

Efforts to do better are under way, and Total Impact [27], Microsoft's Academic Search [28], and Google Scholar Citations [29] are examples. To do better requires better quantification, and that means to uniquely identify and quantify items of scholarship produced by each scholar. Unambiguous author identification requires a unique identifier. ORCID [30] seems to be likely candidate at this point in time. More than one system of identification is an inconvenience but not necessarily a show stopper because identifiers can be automatically mapped to each other. Beyond an accepted identifier, items of scholarship must be tagged with that identifier as metadata. In this way, resolvers can identify and gather items of scholarship on demand. A Google Scholar Citation or a Microsoft Academic Search gathers an individual's published scholarship without a unique identifier to disambiguate, but it is hard to imagine that extending to datasets provided by a scholar, reviews of papers performed by a scholar, or blog posts considered meaningful. All parts of scholarship are yet mostly unrewarded at this time. Beginning to assign such rewards could indeed change the perception of scholarship itself. I admit this is a hard sell, particularly at a time when funds to support scholarship are hard to come by. When stressed, the system of scholarship reward falls back on tradition—publish in the most prestigious closed-access journals and worry less about data sharing and the like.


I am not qualified to imagine what would be going through Jim's mind at this time, but I can say that we have evolved in our collective thinking in the 5 years since Jim last talked about the Fourth Paradigm. Data preservation and open-access publishing mandates from funders are prime examples of that evolution taking us deeper into the realm of digital data and knowledge as the drivers of science advancement. That is the good news; the bad news is that the issues that Jim outlined for handling this digital deluge are, for the most part, still with us—too much data to be dealt with effectively, data that are poorly described, and inefficiencies in the tools used to analyze these data, to name three show stoppers. So where do we go from here?

DNA sequencing data as an example are now doubling about every 5 months. This outweighs the decreasing cost of storage and far outweighs our ability to retrieve these data in a timely way. Reductionism as a well-established scientific principle would seem to be the way forward (the Fifth Paradigm?). We store what we can, but we focus on a subset of the data that we can use to make scientific advancement, that is, subset that is dynamic (some data go as new data are added) and well described (metadata and ontologies). Without thinking about it as such, this is what we have been doing at the RCSB PDB for some time. We provide representatives of the complete dataset in our case on the basis of sequence and structure homology, and we annotate, through a feature that we call Molecule of the Month, those structures of most interest to the community. Our complete dataset is relatively small, and reductionism is driven by the ability to annotate rather than the ability to store. In the foreseeable future, it is unlikely that we will need to discard data, but these are decisions some data providers and curators will have to make. Reproducibility (the Fifth Paradigm?) then becomes a key consideration—can I reproduce the data I have to discard and at what cost?

In the biomedical sciences, this would leave us with silos of well-curated, reproducible, and representative data. Silos are the norm now, and this seems destined to continue as dictated by funding models and expertise in particular types of data. The problem is that science advances by discoveries made across silos. For example, the whole field of translational medicine requires integration of data ranging from genotype to phenotype. In fact, the future of biology and medicine research would seem to depend on data integration (the Fifth Paradigm?) across biological scales. The National Library of Medicine's Entrez ‘The Life Sciences Search Engine’ [31] is a stellar example of an integrated data service, but it is only the tip of the data iceberg in terms of coverage and modes of access—thousands of other valuable data resources exist on the Web. The future would seem to be about discovery informatics (the Fifth Paradigm?), and it goes something like this in my laboratory of the future.

At the end of each workday, our laboratory members' electronic laboratory notebooks are scanned for the day's entries and semantic connections made that define the most salient interests of the individual and the laboratory as a whole for the past day. Directed and deep search of the Web is done to return from the literature, data repositories, social media, and other types of Web sites, information ranked relative to the relevance to our work. Over coffee the next morning, we review the highly relevant findings that help drive our research that day. I think Jim would like this as the progression of the Fourth Paradigm, and for me, it is the ‘Reaming of Life’.