Solving bottlenecks in data sharing in the life sciences
The joint Open PHACTS/GEN2PHEN workshop on “Solving Bottlenecks in Data Sharing in the Life Sciences” was held in Volendam, the Netherlands, on September 19 and 20, 2011, and was attended by representatives from academia, industry, publishing, and funding agencies. The aim of the workshop was to explore the issues that influence the extent to which data in the life sciences are shared, and to explore sustainability scenarios that would enable and promote “open” data sharing. Several key challenges were identified and solutions to each of these were proposed. Hum Mutat 33:1494–1496, 2012. © 2012 Wiley Periodicals, Inc.
“Solving Bottlenecks in Data Sharing in the Life Sciences”, a joint Open PHACTS (http://www.openphacts.org/)/GEN2PHEN (http://www.gen2phen.org/) workshop, was held in Volendam, the Netherlands, on September 19 and 20, 2011. The workshop aimed to explore two directly related topics: data sharing (pushing data outward from its source) and data access (pulling data inward from one or several resources for further use). The goal was to devise a schema to maximally enable and promote “open” data sharing, that is, precompetitive, unencumbered, unrestricted, universally equitable dissemination of datasets generated by academia and (wherever possible also) by industry. Therefore, the focus was restricted to datasets that are ethico-legally “safe” to share, or can be made safe to share by some preprocessing, aggregation, anonymization, or via advanced data access methods designed to protect the data. It was accepted that it would not be possible to share certain other datasets in any kind of “open” fashion.
The meeting attracted 82 participants, signaling considerable interest in the topic. This probably also indicates that a tipping point is being reached regarding the sharing of data as a common and crucial aspect of data-driven science. This is clearly driven by the emergence of “-omics” technologies and other large-scale studies, which demand new research practices and innovative business models and data management structures. These things, however, are typically not the expertise (nor the interest) of scientists, but require specific professional knowledge.
Setting the Scene
To set the scene and to provide a firm grounding for latter discussions, the meeting commenced with three short presentations by “data owners” from public and private settings Evan Bolton (PubChem; http://pubchem.ncbi.nlm.nih.gov/), Frank Schacherer (BIOBASE; http://www.biobase-international.com/), and Christine Chichester (Open PHACTS), concerning the issues that they see as important for them. This was followed by four keynote presentations:
“Introduction of the IMI Joint Undertaking (http://www.imi.europa.eu/)”, Ann Martin, Principal Scientific Manager, Knowledge Management, Innovative Medicines Initiative Joint Undertaking.
“Open Access and Open Source: now free lunch!”, Jan Velterop, Open PHACTS and AQnowledge (http://aqnowledge.com).
“Open Source licensing and sustainability models for effective data sharing in the Life Sciences”, John Wilbanks, Creative Commons (http://creativecommons.org) and Sage BioNetworks (http://sagebase.org).
“Forms of OPEN Sharing that avoid data disclosure, and methods to make CONTROLLED sharing equivalent to OPEN sharing”, Anthony J. Brookes, GEN2PHEN.
The discussions that followed identified three key elements to be discussed on day two of the meeting, with the expectation that each area would throw up a similar set of challenges that would need to be addressed:
Enabling data sharing (getting data out or “exposed” from their source).
Providing data access (reaching in to get the shared data).
Creating appropriate incentives and sustainability scenarios (comprising the challenges and opportunities that stem from the innate “value” of data, and the maintenance and development of data-sharing platforms).
Challenges and Solutions
The setting for day two was novel: no sitting around in the usual meeting rooms. Instead, the participants were divided into three groups, which set sail from Volendam aboard three traditional Dutch flat-bottomed sailing boats on the Markermeer inland sea. After having had breakfast on board, discussions were guided by experts in three areas, namely, legal/licensing, social/ego system, and sustainability. Each team was asked to discuss around one of these three topics, to identify specific challenges and to propose solutions. The key conclusions are summarized.
The discussion regarding this topic focused around the need to set up a common legal and licensing framework that would enable and facilitate data sharing to happen within an acceptable level of certainty. Although workshop attendees acknowledged that this would be essential for articulating exploitation and sustainability scenarios around data sharing, it was also agreed that up to now most collaborative initiatives in this field had not handled this issue properly. Licensing and business models around “open data” or “open source” are available, but their use is not widespread enough, at least within academia and between academia and industry, and there is clearly a lack of expertise in legal and licensing issues incorporated in research consortia. Following the discussions over the two days, the legal experts advocated keeping things simple by using standard, simple, and globally accepted licenses rather than creating ad hoc models for each new endeavor. Proposals ranged from creating machine-readable resolution services for data licenses, to document the current procedures and data interoperability guidelines, or to develop a regime for managing liability for data providers and data-sharing infrastructures. One option would be to create a market place where data owners and consumers could use standard, easily accessible solutions for agreements and licenses.
The consensus was that the projects would need specific advice on data licensing and more precisely also on the option of data “publication” in machine-readable formats under the appropriate licenses, so that computational knowledge discovery techniques can be efficiently applied over broad ranges of multi-omics data toward the understanding of complex biological issues.
The slightly strange title for this section reflects the realization that there can often be a clash of egos when the issues of data sharing are discussed. Everybody agreed that it is good to share data but then comes the discussion about data-sharing incentives, reward and recognition, and the tricky issue of trust in what others are doing. It is advantageous for the group if everyone shares. However, it is advantageous for individual members of the group to not participate in sharing and to reap the benefits of others sharing. How do we protect against database “piracy” and ensure that the data will not be grabbed without due recognition, attribution, or citation in publication as paper or Web application? In turn, this leads to considerations of how data sharing might be monitored for proper attribution and citability crediting.
The consensus was that in spite of the very real barriers, open data sharing is generally perceived to be desirable and that the field can only get from the current state to the desired future state by systematically removing these barriers. This could be achieved by actively promoting the advantages of sharing among data owners. There is a need to foster the recognition of sharing by data producers, perhaps by cultivating the notion that sharing should be seen as a form of “publication” in its own right. To enable this step change requires the establishment of processes similar to paper publication—a measure of the utility of the data to others—not necessarily an impact factor, but more like a Google page rank, or “Altmetrics” (http://altmetrics.org/manifesto).
A further dimension to this is that all secondary database usage must declare the provenance of the primary data. This is perhaps best realized by the use of Creative Commons CC-BY attribution, but it is important to recognize that attribution in itself does not equate with citation.
Both Open PHACTS and GEN2PHEN actively pursue the issue of data-sharing incentives.
A related matter that is tightly intertwined with any consideration of data sharing and data access is that of sustainability of online data resources. In short, how can resources be made sustainable if the primary commodity (the data) are deemed “valuable,” and yet have to be handed on openly (here, meaning “free”)?
These issues were debated in the workshop, but those discussions did not extend to the question of sustaining the data generation activities themselves, or to enabling the radical further development of online resources. It was felt that sustainability challenges cannot be resolved in a “silo-like” manner, and so a coordinated approach to the problem is desired, with exemplar sustainability models being part of the way forward. As part of this, funders need to allocate earmarked funds to data stewardship in research grants, but only distribute those funds given evidence of actual delivery on promises (via local or outsourced approaches). Lastly, projects themselves need to realize that starting sustainability efforts cannot be left as a low priority or started too late to be in effect once the lifespan of the project is over. In addition, specific action is needed to convince governments and industry that the intrinsic value of data should be properly recognized and actually valued. This is a particular issue in IMI projects where in-kind contributions from the European Federation of Pharmaceutical Industries and Associations (EFPIA) companies are partly based on sharing of previously firewalled data. Data as a form of intellectual currency is a concept to be explored further.
The common denominator in the keynote presentations and in the group discussions was that a kind of a tipping point has been reached and that the “vacuum” created by changing the way in which science is done (i.e., knowledge is published/shared) needs to be filled with new business models and data-sharing agreements.
The challenges identified and the solutions recommended need extensive additional study with the goal of creating a fully integrated solution. This suggests the need for a dedicated team and not a “Friday afternoon” approach by already overcommitted project team members and coordinators.
Designing and implementing a broad sustainability plan is typically not within the expertise (nor the desire) of scientists and requires specific professional management, legal and business model knowledge.
Data-generating projects need to realize that sustainability efforts cannot be a low priority and that such efforts need to be commenced sufficiently early that they are established well in advance of the end of the project.
Most data-intensive projects have the same sustainability challenges, which should not be resolved in a “silo-like” manner; it makes sense to pool resources and assign this task to a professional team to develop a common solution, implemented for all projects.
Although it is recognized that each project has its own goals and deliverables, the Open PHACTS and GEN2PHEN leaderships, in collaboration with related projects, could take a leading role in organizing a dedicated team of professionals to ensure that steps are taken to establish the necessary principles, guidelines, and mechanisms to ensure that data can be optimally shared for maximal benefit, and that such efforts can be sustained.
Disclosure Statement: The authors declare no conflict of interest.