We must all accept that science is data and that data are science, and thus provide for, and justify the need for the support of, much-improved data curation. (Hanson, Sugden, & Alberts, 2011)
Researchers are producing an unprecedented deluge of data by using new methods and instrumentation. Others may wish to mine these data for new discoveries and innovations. However, research data are not readily available as sharing is common in only a few fields such as astronomy and genomics. Data sharing practices in other fields vary widely. Moreover, research data take many forms, are handled in many ways, using many approaches, and often are difficult to interpret once removed from their initial context. Data sharing is thus a conundrum. Four rationales for sharing data are examined, drawing examples from the sciences, social sciences, and humanities: (1) to reproduce or to verify research, (2) to make results of publicly funded research available to the public, (3) to enable others to ask new questions of extant data, and (4) to advance the state of research and innovation. These rationales differ by the arguments for sharing, by beneficiaries, and by the motivations and incentives of the many stakeholders involved. The challenges are to understand which data might be shared, by whom, with whom, under what conditions, why, and to what effects. Answers will inform data policy and practice.
The data deluge has arrived. Much anticipated by the science community (Hey & Trefethen, 2003), the popular press is now heralding the wide availability of data for use by anyone, anywhere. Not only have Nature (2008, 2009) and Science (2011), the premier science journals, published feature sections on “big data,” so have Wired magazine (Anderson, 2008), and The Economist (2010). Universities are assessing their rights, roles, and responsibilities for managing and for exploiting data from their researchers (Association of Research Libraries, 2009; Lyon, 2007).
Grand expectations for the data-rich world include discoveries of new drugs, a better understanding of the earth's climate, and improved ability to examine history and culture. The growth of data in the “big sciences” such as astronomy and physics has led to new models of science—collectively known as the “fourth paradigm”—and to the emergence of new fields of study such as astroinformatics, computational biology, and digital humanities (Borgman, 2009; Hey, Tansley, & Tolle, 2009). The “long tail” of science also is becoming more data-intensive, as new methods and instrumentation enable individual investigators and small teams to gather unprecedented volumes of observations.
If the rewards of the data deluge are to be reaped, then researchers who produce those data must share them, and do so in such a way that the data are interpretable and reusable by others. Underlying this simple statement are thick layers of complexity about the nature of data, research, innovation, and scholarship, incentives and rewards, economics and intellectual property, and public policy. Sharing research data is thus an intricate and difficult problem—in other words, a conundrum.
The “dirty little secret” behind the promotion of data sharing is that not much sharing may be taking place. Despite pressure from funding agencies and findings that sharing research data may increase citation rates (Piwowar, Becich, Bilofsky, & Crowley, 2008; Piwowar & Chapman, 2010; Piwowar, Day, & Fridsma, 2007), relatively few studies document consistent data release. Data sharing activities appear to be concentrated in a few fields, and practices even within these fields are inconsistent (British Library, 2009; Cragin, Palmer, Carlson & Witt, 2010; Palmer, Cragin, Heidorn, & Smith, 2007; Wynholds, Fearon, Borgman & Traweek, 2011). In nine years of studying data practices in a National Science Foundation (NSF) Science and Technology Center, we have found that little research data is circulated beyond the research teams that produce them, and few requests are made for these data (Mayernik, 2011; Wallis, Mayernik, Borgman, & Pepe, 2010). The reasons for not sharing data are many. Researchers may lack the expertise, resources, or incentives to share their data. Data often do not exist in transferable forms. Some data are not sharable for ethical or epistemological reasons. In many cases, it is not clear what are “the data” associated with a research project.
This article explores the complexity of data sharing, examining the roots of current discourse, the problematic notion of “data” per se, current policy arguments in favor of data sharing, differing perspectives of stakeholders, and associated ethical, professional, and epistemological aspects of research data. It is a foray into a labyrinth worthy of book-length examination.
Why Is Data Sharing Urgent?
Sharing data is not a new topic of discussion in research and policy circles. Thoughtful reports on the reasons to improve data sharing and curation date at least from the 1980s (Fienberg, Martin, & Straf, 1985), and many more reports followed (National Research Council, 1995, 1997, 2009; National Science Board, 2005; Networking and Information Technology Research and Development, 2009; Berman et al., 2010; Dalrymple, 2003; Esanu & Uhlir, 2003, 2004; Hanson et al., 2011).
“Data sharing” has many meanings in these reports, which are rarely made explicit. For the purposes of this article, data sharing is the release of research data for use by others. Release may take many forms, from private exchange upon request to deposit in a public data collection. Posting datasets on a public website or providing them to a journal as supplementary materials also qualifies as sharing. The degree of usefulness, trustworthiness, and value of shared data varies widely, however. Some may be richly structured and curated. Others may be raw files with minimal documentation. Similarly, the intended users may vary from researchers within a narrow specialty to the general public.
Funding agencies have begun requiring data release to varying degrees, and with varying degrees of enforcement. The National Institutes of Health (NIH) added a data management plan requirement in 2003 for grants over $500,000 (National Science Board, 2005). The NSF has long had this statement requiring data sharing in its grant contracts, but has not enforced the requirement consistently (National Science Foundation, 2001, 2010b):
Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing. (National Science Foundation, 2010b, n.p.)
In 2010, the NSF made the long-anticipated announcement that all future grant proposals would require a two-page Data Management Plan that addresses the above-mentioned requirement and that the Plan would be subject to peer review (National Science Foundation, 2010b,2010c,2010a, 2011). The NSF requirement is thus more comprehensive than that of the NIH, which applies only to larger grants and is negotiated between investigators and program officers rather than being subject to peer review.
U.K. funding agencies began to formulate data release policies in the 1990s (Wellcome Trust, 1997, 2001, 2003; Economic and Social Research Council, 2010; Lyon, 2007). In response, the Digital Curation Centre (DCC), founded as part of the U.K. eScience initiatives, created a series of templates for data management plans corresponding to the requirements of individual U.K. funding agencies (Digital Curation Centre, 2011). The DCC is adding links to U.S. agency requirements, and the DCC templates are the basis for many of the planning documents now being developed by U.S. universities (Abrams, Cruse, & Kunze, 2009; Witt, Carlson, Brandt, & Cragin, 2009).
Similarly, selected journals long have required the deposit of data and other research documentation associated with published articles. Requirements to deposit genome sequences are best known (Wellcome Trust, 1996, 1997; GenomeCanada, 2005; Berman et al., 2000; Hilgartner, 1998), but journals in economics and many other fields also require access to data. The mechanisms of enforcement may be formal, for example, by requiring deposit in specific collections such as the Protein Data Bank, with the structure entry number included in the article (Protein Data Bank, 2011), or less formal, such as links to sources.
Journal policies have also become more rigorous about data access as of late. Science, in an editorial accompanying a special issue on data, announced more extensive requirements, such as sharing computer code “involved in the creation or analysis of data” and including a “specific statement regarding the availability and curation of data” in the article's acknowledgements (Hanson et al., 2011). Also recently announced are new data archiving policies by “key journals in evolution and ecology” including The American Naturalist, Evolution, Journal of Evolutionary Biology, Molecular Ecology, and Heredity (Dryad, 2011; Whitlock, McPeek, Rausher, Rieseberg, & Moore, 2010, p. 145). These journals are requiring or encouraging the deposit of data in public archives.
In 2010, the Committee on Data for Science and Technology of the International Council for Science (CODATA) established a new task group to study data citation and attribution (Data Citation Standards and Practices, 2010). Among the chief concerns of the task group are best practices in citing data, attributing data sources, and giving credit to data creators, curators, and others who add value to research data (National Academies of Science, 2011). The CODATA task group is surveying scholarly disciplines and stakeholders well beyond the sciences, including universities, publishers, and funding agencies.
Although none of these actions alone caused the sense of urgency for data sharing, the NSF requirement for data management plans appears to be the tipping point, at least in the United States. The NSF has an encyclopedic scope across the sciences and social sciences, excluding only the arts, humanities, and medicine, and even funds grants in these areas for projects that address scientific problems. The NSF requirement applies to all proposals, of any size, in any directorate. Although these are data management plans and not data sharing plans, they do strongly encourage sharing and they are subject to peer review. Thus, an investigator's ability to articulate what her or his data are, how they will be managed, how they will be shared, and, if not shared, then why, will influence whether or not a project is funded. In making these plans part of the peer review process, the NSF has accelerated the conversation about data sharing among stakeholders in publicly funded research.
What is often not explicit in the discussions of data management plans and data sharing requirements are the competing interests and differing incentives of the many stakeholders involved. These elements need to be brought to the foreground of the data sharing conversation.
What Are Data?
A starting point to discuss the conundrum of sharing research data is to examine the complex notion of data. An artifact or observation may be, at best, “alleged evidence,” to use Michael Buckland's pithy phrase (Buckland, 1991). Data may exist only in the eye of the beholder: The recognition that an observation, artifact, or record constitutes data is itself a scholarly act. Data curators, librarians, archivists, and others involved in data management may be offered a collection that is deemed data by the collector, but not perceived as such by the recipients. Conversely, an investigator may be holding collections of materials without realizing how valuable they may be as data.
The concept of data is difficult to define, as data may take many forms, both physical and digital. Among the most widely cited definitions is this one, from a National Academies of Science report: “Data are facts, numbers, letters, and symbols that describe an object, idea, condition, situation, or other factors” (National Research Council, 1999, p. 15). A more current working definition, from an internal Academy document, is particularly useful for discussions of sharing:
The term “data” as used in this document is meant to be broadly inclusive. In addition to digital manifestations of literature (including text, sound, still images, moving images, models, games, or simulations), it refers as well to forms of data and databases that generally require the assistance of computational machinery and software in order to be useful, such as various types of laboratory data including spectrographic, genomic sequencing, and electron microscopy data; observational data, such as remote sensing, geospatial, and socioeconomic data; and other forms of data either generated or compiled, by humans or machines. (Uhlir & Cohen, 2011)
The above-mentioned notion of data transcends the sciences and other domains of scholarship, acknowledging the many forms that data can take. Data sources also vary widely. In the physical and life sciences, most data are gathered or produced by researchers, such as by observations, experiments, or models. In the social sciences, researchers may gather or produce their own data, or they may obtain data from other sources such as public records of economic activity. The notion of data is least well developed in the humanities, although the growth of digital humanities research has led to more common usage of the term. Humanities data most often are drawn from records of human culture, whether archival materials, published documents, or artifacts (Borgman, 2007, 2009).
The term dataset is sometimes conflated with the notion of data. However, definitions of dataset in the scientific literature have at least four common themes—grouping, content, relatedness, and purpose—each of which has multiple categories (Renear, Sacchi, & Wickett, 2010). Although dataset may be useful to refer to a collection of data for the purposes of citation, the term does little to clarify what is meant by data. The difficulty of identifying appropriate units of data to reference is a core problem in establishing practices for data citation and attribution (Borgman, 2011).
Communities and Data
In the requirement for data management plans, the NSF sidesteps the definition of data with the first of its Frequently Asked Questions (National Science Foundation, 2010a):
1. What constitutes “data” covered by a Data Management Plan? What constitutes such data will be determined by the community of interest through the process of peer review and program management. This may include, but is not limited to: data, publications, samples, physical collections, software and models.
The NSF's choice of the term community of interest echoes the practice of the digital archiving world, where policies are framed in terms of the designated community (Consultative Committee for Space Data Systems, 2002). It is left to the investigator—or to the data archive—to designate the appropriate community of interest.
Therein lies the rub. An investigator may be part of multiple, overlapping communities of interest, each of which may have different notions of what are data and different data practices. The boundaries of communities of interest are neither clear nor stable. In the case of data management plans, an investigator is asked to identify the appropriate community for the purposes of a specific grant proposal and for the proposed duration of that award.
Communities of interest, as used by the NSF, appear to be narrower than disciplines or research specialties. Communities of practice (Lave& Wenger, 1991; Wenger, 1998) and epistemic cultures (Knorr-Cetina, 1999) are groupings commonly used in social studies of science. The concept of communities of practice was originated by Lave and Wenger to describe how knowledge is learned and shared in groups, a concept subsequently much studied and extended (Osterlund & Carlile, 2005). Epistemic cultures, in contrast, are neither disciplines nor communities. They are more a set of arrangements and mechanisms associated with the processes of constructing knowledge, and include individuals, groups, artifacts, and technologies (Knorr-Cetina, 1999; Van House, 2004). Common to both communities of practice and epistemic cultures is the idea that knowledge is situated and local. Nancy Van House (2004) summarizes this perspective succinctly: “There is no ‘view from nowhere'—knowledge is always situated in a place, time, conditions, practices, and understandings. There is no single knowledge, but multiple knowledges” (p. 40).
It is the difficulty of bounding either community of interest or data, not to mention bounding the intersection of these two concepts, that makes data sharing requirements so challenging to articulate. Thus, the next sections are devoted to explicating the many forms of research data that might be created and shared, or at least made sharable.
Categories of Data
Some types of data have both immediate and enduring values; some gain value over time, some have transient value, and yet others are easier to recreate than to curate (National Research Council, 1995, 1997, 2009). Many of these distinctions depend on the category of data, as identified in an influential NSB report (National Science Board, 2005): observational, computational, experimental, and records.
Observational data include weather measurements and attitude surveys, either of which may be associated with specific places and times or may involve multiple places and times (e.g., cross-sectional, longitudinal studies). Computational data result from executing a computer model or simulation, whether for physics or cultural virtual reality. Replicating the model or simulation in the future may require extensive documentation of the hardware, software, and input data. In some cases, only the output of the model might be preserved. Experimental data include results from laboratory studies such as measurements of chemical reactions or from field experiments such as controlled behavioral studies. Whether sufficient data and documentation to reproduce the experiment are kept varies by the cost and reproducibility of the experiment. Records of government, business, and public and private life also yield valuable data for scientific, social scientific, and humanistic research.
The National Science Board's four categories of data are useful as a general framework, but tend to obscure the diversity of types of data that may be collected in any given scholarly endeavor. The physical and life sciences, which are the focus of the long-lived data collections report, exemplify all four categories. The social sciences also collect observations, model social systems, conduct experiments, and assemble records. Humanities scholars do all of the above, but are the least likely to perform experiments.
Investigators collect data for many purposes, using many methods. Research purposes, methods, and approaches all influence what investigators consider to be their “data,” the degree to which those data might be sharable, and the conditions under which researchers are willing to share those data with others. The criteria for identifying data and for sharing them are not yet well understood. Understanding practices, problems, and policies for data is an expanding area of research in the fields of information studies and social studies of science (Borgman, 2007; Bowker, 2000, 2005; Edwards, Mayernik, Batcheller, Bowker, & Borgman, 2011; Karasti, Baker, & Halkola, 2006; Mayernik, 2011; Mayernik, Batcheller, & Borgman, 2011; Palmer, 2005; Renear & Palmer, 2009; Ribes, Baker, Millerand, & Bowker, 2005; Ribes & Finholt, 2007; Wynholds et al., 2011; Zimmerman, 2007).
Purposes for Collecting Data
A brief survey of the purposes for which research data are collected will illustrate some of the complexities that arise in making them available to other potential users. Figure 1 presents three dimensions along which data collection may vary. These dimensions are neither exhaustive nor mutually exclusive. For each dimension, the first pole is the more local and flexible type of purpose, whereas the second pole is more global and systematized. Four scenarios drawn from our research on data practices are used to illustrate these dimensions, along with a variety of other examples. Three of the scenarios are introduced in this section: beach quality, star dust, and an online survey. The fourth scenario, archival records, is introduced in the section on Approaches to Handling Data.
Specificity of purpose
The first dimension illustrated is specificity of purpose, ranging from exploratory research to building observatories. Exploratory investigations pursue specific questions, often at a specific site, usually about a specific phenomenon, and may take place in a laboratory, a field setting, or some combination thereof.
Studies to identify sources of bacteria and other beach contaminants offer examples of exploratory research. In the beach quality scenario, one or two students collect water samples, selected for time of day, location, weather conditions (e.g., dry or rainy), and other factors. Using a small portable wet lab, they dilute the samples to standard pH levels. The dilution varies by the expected concentration of bacteria, a judgment that requires scientific expertise specific to this type of research.
Once samples are collected, diluted, and brought to the campus lab, one of two processing methods is applied. The simpler and least expensive method is to culture the samples for 24 hours, then to count the bacteria. This method is slow and too insensitive to distinguish between human and animal sources of bacteria. The more sophisticated method is quantitative polymerase chain reaction (qPCR), adapted from medical applications, which requires greater expertise and is much more expensive. This method is faster and more sensitive, but results will vary between laboratories due to choices of local protocols, filter material, machine type and model, and handling methods. Protocols and results are shared between partner laboratories seeking to perfect the method, but little other than the methods of data collection, protocols, and final curves might be reported in the journal articles. Biological samples are fragile; they degrade quickly or are destroyed in the analysis process.
At the other end of the specificity dimension are observatories, which are institutions for the observation and interpretation of natural phenomena. Examples include NEON and LTER in ecology (National Ecological Observatory Network, 2010; U.S. Long Term Ecological Research Network, 2010; Porter, 2010), GEON in the earth sciences (GEON, 2011; Ribes & Bowker, 2008), and synoptic sky surveys in astronomy (Panoramic Survey Telescope & Rapid Response System, 2009; Large Synoptic Sky Telescope, 2010; Sloan Digital Sky Survey, 2010). Observatories attempt to provide a comprehensive view of some whole entity or system, such as the earth or sky. Global climate modeling, for example, depends upon consistent data collection of climate phenomena around the world at agreed upon times, locations, and variables (Edwards, 2010).
The value of observatories lies in systematically capturing the same set of observations over long periods of time. Astronomical observatories are massive investments, intended to serve a large community. Investigators and others can mine the data to ask their own questions or to identify bases for comparison with data from other sources. Studies of the role of dust emission in star formation make use of observatory data. In this star dust scenario, a team of astrophysics researchers queries several data collections that hold observations at different wavelengths, extracting many years of observations taken in a specific star-forming region of interest. They apply several new methods of data analysis to model physical processes in star formation. By combining data from multiple observatories, they produce empirical results that enable them to propose a new theory. Typically the combined dataset is released when they publish the journal article describing their results.
Scope of data collection
The second dimension of Figure 1 is the scope of data collection. At one pole are studies that describe particular events or phenomena; at the other are studies that model entire systems. The beach quality scenario above is descriptive in nature, while the star dust scenario is closer to the model end of the dimension, as the goal of their research is to model physical processes. Climate research spans this spectrum. Weather data, in the short term, can be used to describe or predict rain, snow, wind, or other events. Systematic climate observations are used as inputs to models of physical processes to study the earth's climate. The same observations may be input to multiple models, each developed by different research teams, and each with its own set of parameters and theories (Edwards, 2010).
Survey research spans these dimensions, as individual studies may be more or less specific and may vary considerably in scope. In the online survey scenario, a single investigator constructs a survey of student attitudes that is deployed to several hundred universities. Large-scale surveys are suitable for hypothesis testing and description of populations. Interviews are better for exploratory studies, and also can be used to develop theories. Conversely, because large online surveys are highly structured and usually anonymized, the resulting data are easier to share. Interviews are more open ended, personalized, harder to anonymize, and difficult to code in consistent formats.
Among the great promises of data sharing is the ability to aggregate and compare multiple local studies. However, such integration of data is not always possible or desirable. Many research domains are concerned with rich descriptions of complex phenomena that are associated with specific times and places. Marine biologists, for example, study local phenomena such as harmful algal blooms. They may collect data for months or even years to capture conditions before, during, and after an event. Their goal is to understand the processes that trigger an event and how those processes evolve (Borgman, Wallis, Mayernik, & Pepe, 2007; Gobler, Boneillo, Debenham, & Caron, 2004). Small studies may cumulate into larger endeavors; in that case, the data from each individual study may become more valuable as the data cumulate, enabling comparisons across time periods and locations. In other cases, small studies may be one-off investigations of individual phenomena at a particular time and place (Borgman, Wallis, & Enyedy, 2006; Bowker, 2000; Karasti et al., 2006; Karasti, Baker, & Millerand, 2010).
It has proven difficult to aggregate studies of biological events into comprehensive systems models of the type used in climate research, due largely to differences in data characteristics (Aronova, Baker, & Oreskes, 2010). The ecological sciences community is promoting data sharing by creating standards for data documentation and by standardizing collection practices, as their research tends to focus on local phenomena (Moore, McPeek, Rausher, Rieseberg, & Whitlock, 2010; Whitlock, 2011). Data in most of the examples above would be considered observations, per the NSB categorization. Some might be considered natural experiments, such as comparisons of phenomena that occur in similar environments. Some of these observations could be aggregated into models of phenomena, while others could not. The beach quality scenario reflects the local characteristics of much data collection. No matter how well they document their field practices, gathering an identical set of water samples is impossible due to changing environmental conditions.
Goal of research
The third dimension in Figure 1 is the goal of the research, ranging from empirical to theoretical. Most research in the sciences and social sciences relies on empirical data of some sort. Scholars in the humanities often collect data, such as assembling records and notes from archives. Theoreticians, who may or may not be the same people as those who collect empirical data, draw upon various forms of evidence to propose and to test theories.
In empirical investigations, some variables are controlled and others are tested. To study beach quality, for example, researchers can control variables by diluting samples to standard pH values and by following consistent protocols. They can compare samples from different sites under different conditions. Their experiments can be replicated by gathering new samples, but they cannot reproduce results, as any given sample can be analyzed only once.
Those studying the atmosphere or the universe have little control over their variables. However, they can conduct experiments on theoretical models. In theoretical modeling—whether climate, astronomy, or the economy—data may be simulated, rather than collected from the “real world.” Observations of the physical universe occur at a unique place and time and can never be reconstructed, whereas experiments and models can be recreated (National Research Council, 1995). Important differences in terminology arise between experimentalists and theorists, however. Observations of the real world are data to experimentalists, while theoreticians often consider the simulated observations that are output from models to be their data (Edwards, 2010; Wynholds et al., 2011).
Approaches to Handling Data
Data collection, processing, management, and interpretation can be approached in many ways. Individuals and communities apply many combinations of techniques to identify, capture, describe, analyze, derive, and make sense of their data. As with the dimensions of purposes outlined above, this selection of approaches is intended to be illustrative rather than exhaustive or mutually exclusive (Figure 2). They also range from the more local and flexible at the first pole to the more global and systematized on the second pole of each dimension.
The first dimension of approaches to data collection is the number of people involved. Individuals working alone have complete control over their methods and their data. Teams, who may be widely distributed, have to agree upon what data will be collected, by what techniques and instruments, and who has the rights and responsibilities to analyze, publish, and release those data (Borgman, Bowker, Finholt, & Wallis, 2009; David, 2004; Olson, Zimmerman, & Bos, 2008; Ribes & Finholt, 2007). The historian works alone with archival records. The sociologist conducting the online survey of student attitudes may work alone or with a small team of students and statisticians. The beach quality research is conducted by 5 to 10 graduate and undergraduate students, and led by a single investigator. In contrast, dozens—if not hundreds or even thousands of people around the world—may be involved in collecting and curating data in observatories. Those who draw upon the data from those observatories may be individuals or teams of any size.
Labor to collect data
Approaches to data collection also differ in the amount of human labor required—the second dimension in Figure 2. Investigators in beach quality, marine biology, or other field research may spend days, weeks, or months hand-gathering physical samples of soil, water, or plants, which then must be processed in a laboratory to extract data—a process that also may require days, weeks, or months. Similarly, the historian may spend months or years in historical archives, taking notes on a laptop, or only with pencil on paper, by the rules of some archives. In this archival records scenario, the scholar may devote months or years to extracting useful data from those notes. These labor-intensive approaches have the advantage of flexibility and local control by the investigators. They have the disadvantages, from a data sharing perspective, of being difficult to replicate and of producing data that are not consistent in form or structure.
Machine-collected observations, whether by telescopes, sensor networks, online survey software, or social network logs, may be labor-intensive to design and develop, but once deployed can produce massive amounts of data that can be used by many people. Major telescopes, both on land and in space, for example, require long-term collaborations among scientists and technologists. Data structures, management, and curation plans are developed in parallel with the design of studies and instruments, a process of a decade or more. Machine-collected data tend to be consistent and structured, and to scale well, but considerable expertise is required to interpret them. Conversely, these forms of data collection are less flexible and adaptable to an individual investigator's research questions. Observation parameters may be hardwired or coded into the technologies, thus determining the data that can be obtained.
Standards for data structures, metadata, and ontologies depend on consistent data collection. The European Union, for example, is developing Project EMODNET “to assemble fragmented and inaccessible marine data into interoperable, continuous and publicly available data streams” for maritime operations and research (European Marine Observation and Data Network, 2011). EMODNET will promulgate data standards and infrastructure in multiple marine and maritime observatory systems (SeaDataNet, 2009).
Another consideration is whose labor is involved. In most cases, investigators and their students and staff collect their own data. In some forms of citizen science, data are collected by volunteers with varying degrees of training or expertise, whether in bird spotting (eBird, 2009) or identifying invasive species (What's Invasive!, 2011).
Labor to process data
A third dimension of approaches to data collection is the amount and type of processing required for interpretation. At one pole of this dimension in Figure 2 are hand-processing techniques. These techniques may be simple, such as measuring the dimensions of leaves, assigning geospatial and temporal parameters (e.g., precise location and time where gathered) and experimental parameters (e.g., amount of sunlight and shade, proximity to the ground or water). In other cases of collecting physical samples, the actual data may be instrument readings (e.g., a type of nitrate as indicated by a voltage measurement on a sensor, or concentration of a bacterium in parts per million of water). Whether the numbers are handwritten in a field notebook or machine generated, they must be associated with a specific sample. Other information such as the type of machine, its calibration, the time, date, and place of data collection, and the method by which the sample was captured are necessary to interpret any given data point (Borgman, Wallis, & Enyedy, 2007). Similarly, the historian in an archive is documenting characteristics of the records examined so that those notes can be interpreted later.
In the most highly instrumented research, such as astronomical sky surveys, instruments capture contextual information about the data. Although minimal human labor may be required for processing the data, considerable expertise is required to assess the accuracy of data and metadata in these research environments, as minute errors in calibration can influence analysis and interpretation significantly (Mayernik et al., 2011; Wynholds et al., 2011). Statistical analysis can be applied both to online surveys and to the star dust data drawn from observatories. Human processing labor may be minimal, but the domain expertise required for analysis is high, of course. The beach quality studies fall in the middle of the dimension, as samples gathered by hand may be processed by sophisticated technologies such as qPCR to yield numerical results.
Among the tradeoffs in citizen science is the amount of expert postprocessing required. The public can generate data of great research value, as well as spurious observations and interpretations that can undermine research efforts. Citizen science projects often devote considerable effort to data verification and validation (Cornell Lab of Ornithology, 2009; Galaxy Zoo, 2011).
Generally speaking, the more handcrafted the data collection and the more labor-intensive the postprocessing for interpretation, the less likely that researchers will share their data. However, data types and practices vary so widely across fields and research teams that any such generalizations are difficult to make (Hilgartner & Brandt-Rauf, 1994; Pritchard, Carver, & Anand, 2004).
Why Share Research Data?
As is evident from the above discussion of the purposes and approaches to handling data, investigators (and their collaborators, students, and staff) devote massive amounts of physical and intellectual labor to collecting, managing, and analyzing their data and to publishing their results. Data are the lifeblood of research in any field, but just what are those data varies by purpose, approach, instrumentation, community, and many other local and global considerations. Some of those data may be in sharable forms, others not. Some data are of recognized value to the community, others not. Some researchers wish to share all of their data all of the time, some wish never to share any of their data, and most are willing to share some of their data some of the time. These competing perspectives, the array of data types and origins, and the variety of local circumstances all contribute to the intricacy and difficulty of sharing data.
The pressure to share data comes from many quarters: funding agencies—both public and private—policy bodies such as national academies and research councils, journal publishers, educators, the public at large, and from researchers themselves. These stakeholders each have their own reasons for requiring or encouraging data sharing. In examining the public statements of these entities, some identify explicit benefits of data sharing to specific parties, while others are vague about why data should be shared and whose interests will be served. Rationales, arguments, motivations, incentives, and benefits often are conflated. For the data-sharing conundrum to be addressed effectively, these oft-subtle distinctions need to be brought to the fore.
The model presented below is framed in terms of rationales for sharing data. A rationale is an explanation of the controlling principles of opinion, belief, or practice. An argument, in contrast, is intended to persuade; it is the set of reasons given for an individual or an agency to take action. Underlying these rationales are motivations and incentives, whether stated explicitly or left implicit. A motivation is something that causes someone to act, whereas an incentive is an external influence that incites someone to act. Rationales for sharing data also include beneficiaries, whether stated or implicit. A beneficiary in this case is an individual, agency, community, or other stakeholder who receives a benefit from the act of sharing data, such as the use of those data for a particular purpose (Merriam-Webster's Collegiate Dictionary, 2005).
Four rationales are presented in Figure 3, positioned on two axes. The sources for the model are the policy documents and studies of data sharing cited herein, and the author's participation in public discourse on these issues. The four rationales are to (a) reproduce or verify research, (b) make results of publicly funded research available to the public, (c) enable others to ask new questions of extant data, and (d) advance the state of research and innovation. The dimensions on which these rationales are positioned are arguments for sharing and beneficiaries of sharing. The model is not exhaustive either in terms of rationales or dimensions, but is offered as a useful framework for examining the complex interactions of players, policies, and practices involved in sharing research data.
The arguments dimension (vertical axis) positions the rationales by their emphasis on the needs of the research community or the needs of the public at large. Researchers, funding agencies, and journals often make different arguments for the value of sharing data. Motivations of the many stakeholders may be aligned, but often they are in conflict.
The beneficiaries dimension (horizontal axis) positions the rationales by their emphasis on benefits to researchers who produce the data or benefits to those who might use research data. Here also, motivations of stakeholders may be aligned, but often they are in conflict. Funding agencies are responsible to their research communities and to the public. Journals must serve their readers, their authors, and their publishers. Researchers’ incentives to release their own data may or may not align with their motivations to gain access to the data of others. Similarly, funding agencies’ and journals’ motivations for data release may conflict with the incentives of the researchers who create those data.
Neither dimension is absolute; the poles represent relative positions of people or situations. For example, a researcher or policy maker may make one argument on behalf of the producers of data and another on behalf of the users. Similarly, an argument made in the name of scholarship may also serve the public good. These arguments and beneficiaries are not mutually exclusive; rather, they provide a two-dimensional space in which to place the various rationales in favor of sharing research data.
Subtle distinctions in the rationales for data sharing may lead to markedly different policies, economic models, research practices, curation practices, and degrees of compliance. Of particular concern is how those rationales align with the incentives of those whose work produces the data. Accordingly, discussion of the four rationales focuses most heavily on the concerns of data producers and on their abilities, motivations, and incentives to share their data.
The model proposed here is intended to provoke discussion among the many stakeholders in research data. Most of the examples are drawn from the sciences and social sciences, as these are the areas most studied and are on the front lines of current policy debates. This analysis can be extrapolated to the humanities, where similar policies for data sharing are under discussion (Kansa, Kansa, Burton, & Stankowski, 2010; Unsworth et al., 2006). The ability to implement any data sharing policy will depend on many factors, including local data practices, differences in the intellectual property rights intrinsic to data sources, and the need to maintain confidentiality of human subjects (Borgman, 2007).
To Reproduce or to Verify Research
Reproducibility or replication of research is viewed as “the gold standard” for science (Jasny, Chin, Chong, & Vignieri, 2011), yet it is the most problematic rationale for sharing research data. This rationale is fundamentally research driven but can also be viewed as serving the public good. Reproducing a study confirms the science, and in doing so confirms that public monies were well spent. However, the argument can be applied only to certain kinds of data and types of research, and rests upon several questionable assumptions.
Pressure is mounting to share data for the purposes of reproducing research findings. A recent special issue of Science on replication and reproducibility examines the approaches, benefits, and challenges across multiple fields (Ioannidis & Khoury, 2011; Jasny et al., 2011; Peng, 2011; Ryan, 2011; Santer, Wigley, & Taylor, 2011; Tomasello & Call, 2011). The authors encourage data sharing to increase the likelihood of replication, while acknowledging the very different methods and standards for reproducibility in each field discussed. Particularly challenging are the “omics” fields (e.g., genomics, transcriptomics, proteomics, metabolomics), in which “clinically meaningful discoveries are hidden within millions of analyses” (Ioannidis & Khoury, 2011, p. 1230). Fine distinctions are made between reproducibility, validation, utility, replication, and repeatability, each of which has distinct meaning in individual omics fields.
The Wall Street Journal, in a front-page article reviewing this special issue of Science, oversimplified the concern by stating that “reproducibility is the foundation of all modern research, the standard by which scientific claims are evaluated” (Naik, 2011). Naik's article provides multiple examples of biomedical companies spending tens or hundreds of millions of dollars to reproduce research reported in journal articles, often without success. In this view, lack of reproducibility is a flaw in science, often traceable to problems such as insufficient control groups or to the difficulty of publishing negative results. However, by the 1970s, research in the sociology of science was describing the difficulties of verifying scientific results. Harry Collins studied the role of replication in multiple scientific disputes, providing numerous examples of disagreements about validation methods (Collins, 1983). In extensive studies of gravitational waves in physics, for example, he found that some scientists believed that only the experiments that detected these waves were performed appropriately, whereas other scientists trusted only the experiments that failed to detect such waves (Collins, 1975, 1998).
Peer review rests on expert judgment rather than on reproducibility. Reviewers or referees are expected to assess the reliability and validity of a research report based on the information provided. In only a few fields do reviewers attempt to reanalyze or verify data or to reconstruct all the steps in a mathematical proof or other procedure. Even when data are included with a journal article or conference paper, rarely is enough information provided to reproduce the results. Instrument details and calibration may be omitted, or lab-specific practices may not be documented in sufficient detail. This is normal practice, both because journal space constraints discourage elaborate methods sections and because research expertise relies upon tacit knowledge that is not easily documented (Bowker, 2005; Collins, 1983; Kanfer et al., 2000; Latour, 1987; Latour & Woolgar, 1979, 1986). Yet whenever published articles are withdrawn from major journals, questions are raised about what the reviewers knew—or should have known—about the data and procedures (Brumfiel, 2002; Couzin & Unger, 2006; Couzin-Frankel, 2010; Normile, Vogel, & Couzin, 2006).
Reproducibility is a high bar, and even it has several levels, such as the precise duplication of observations or experiments, exact replication of a software workflow, degree of effort necessary, and whether proprietary tools are required (Reproducible research, 2010; Stodden, 2009a, 2009b; Vandewalle, Kovacevic, & Vetterli, 2009). Observations of the real world—such as water samples, algae, air and soil temperature, or comets—are associated with times and places, and thus rarely can they be reproduced. It is for this reason that observations are worth curating. Experimental observations may be reproduced in laboratory conditions, given sufficient documentation and access to the same materials, equipment, software, and technical expertise. Given the same set of observations, whether natural or experimental, other researchers may be able to replicate the results, if identical circumstances can be achieved. However, experienced researchers know that minute differences in procedures, machine calibrations, temperature and humidity near the instruments, and other factors can introduce undetected variation that influences replicability and interpretation.
Even the research activities are difficult to replicate, because processing and analysis tools such as statistical, mathematical, and workflow software rarely maintain a precise record of the systems or the data at each transaction (Claerbout, 2010; Goble & De Roure, 2009). Questions of the provenance, in both the archival sense of “chain of custody” and the computing sense of “transformations from original state,” arise in interpreting data that may have evolved through multiple hands and processes (Buneman, Khanna, & Tan, 2000; Gil, 2010; Hunter & Cheung, 2007).
Other impediments to reproducibility include difficulties in gaining access to data and to tools used to create and analyze those data, lack of a licensing regime to provide access to proprietary software and to data (Stodden, 2009b), and conflicts in copyright law, such as differences between the United States and Europe in the ability to make proprietary claims on factual matters (Reichman & Uhlir, 2003).
Reproducibility is elusive because of the vagaries of the research process, the varying notions of data, disparate community practices for documenting evidence, the transitory nature of observations, and the combinations of expertise necessary for interpretation of evidence. As a rationale for sharing research data, reproducibility is problematic not only because it applies to so few types of research, but because it risks reducing the research process to a set of mechanistic procedures. Even chemists will acknowledge that their work is as much art as it is science (Lagoze & Velden, 2009a, 2009b). True reproducibility requires deep engagement with the epistemological questions of a given research specialty, and the very different ways in which investigators obtain and value evidence. Reusers of data may not know, or be able to know, what prior actors did to the data. Each step in cleaning or processing data requires judgments, few of which may be fully documented. Later interpretations thus may depend upon multilevel inferences that are statistically problematic (Meng, 2010).
Although reproducibility is a popular term in promoting data release, the concept incorporates a wide array of interpretations, including replication, repeatability, validation, and verification. While a less precise notion, verification may be easier to accomplish. Peer reviewers and readers can judge whether the research methods meet community standards, whether the evidence is appropriate for the claim and whether the arguments are reasonable, even if they are unable to reconstruct all of the procedures. Potential reusers of data also can apply their own judgments about the veracity and usefulness of the data; their criteria may be highly specific to their research topic (Faniel & Jacobsen, 2010; Wynholds et al., 2011).
Motivations to share research data for the purposes of reproducibility or verification vary widely by type and condition of the data and expectations of the community. Where data deposit is required as a condition of publication, as in the case of genomes and the Protein Data Bank discussed earlier, researchers will comply. If reproducibility requires materials that are very difficult to share, such as specialized animals or cell lines, then community practices will dictate the conditions of data release. Considerable human expertise, labor, and licensing of intellectual property may be involved. Researchers give these reasons and many others for refusing to release data (Campbell et al., 2002; Hilgartner, 1997, 1998, 2002; Hilgartner & Brandt-Rauf, 1994). If materials and documentation are highly automated, if no licensing restrictions apply, and if the researcher has completed his or her publication of the results, then data sharing is more likely to occur.
However, researchers may never be “done” with their data. In cases where a research career is based on long-term study of a specific species, locale, or set of artifacts, data become more valuable as they cumulate. Researchers in these situations may be particularly reluctant to release data associated with a specific publication, as it might mean releasing many years of data. Similarly, reproducing the data associated with any given publication is problematic, as the set of observations reported may depend heavily on prior studies and on interpretation of much earlier data.
Scholars may wish to verify findings either to build upon or to refute them. The generalized rationale of sharing data for reproducibility or for verification of results lies near the research-driven pole of the argument dimension in Figure 3. It spans the interests of data producers, who can benefit by having their findings verified by others to reinforce their veracity, and the interests of researchers who would use others’ data. The greater possibility of replication or verification occurs with data produced for the purposes of observatories, modeling systems, or for theory building (the far end of the dimensions in Figure 1), or those collected by large teams, by technologies, and subject to machine processing (the far poles of Figure 2). These categories of data are more likely to be captured and described consistently than are data collected for exploratory investigations, to describe phenomena, for empirical studies, by individual investigators, or processed by hand. Innovation in research requires new methods of research design, analysis, and technology.
From an epistemological perspective, reproducibility and verification are the most problematic of the four arguments for sharing data. Often the research creativity lies in identifying a new method required to approach an old problem. Research outcomes often depend much more on interpretation than on the data per se. Separating data from context is a risky matter that must be balanced carefully against demands for reproducibility.
To Make Results of Publicly Funded Research Available to the Public
Public sentiment for sharing research data is based largely on the rationale that tax monies should be leveraged to serve the public good. In this view, data produced with public funds should be available for use and should not be hoarded by researchers. The public good argument is implicit in the OECD principles, to which several of the funding agency policies refer, namely, that open access to research data is a means to leverage public investment in research (Organisation for Economic Co-operation and Development, 2007). The OECD document also builds upon an earlier U.S. study, explicitly quoting this passage: “The value of data lies in their use. Full and open access to scientific data should be adopted as the international norm for the exchange of scientific data derived from publicly funded research” (National Research Council, 1997).
U.S. public policy tends toward openness of research information. Federal law waives copyright protection on data and information directly produced by government agencies, putting those materials into the public domain in the United States (Reichman & Uhlir, 2003). Data and information resulting from research grants to universities and other agencies do not fall under the same law. Data and information produced by European governments generally do not fall into the public domain, and some of their database laws are more restrictive than those of the United States (Boyle, 2004; Boyle & Jenkins, 2003). However, initiatives such as EMODNET and INSPIRE in the environmental sciences are building infrastructure to combine and mine data throughout the European Union (INSPIRE, 2007; European Marine Observation and Data Network, 2011).
The rationale of making publicly funded research available to the public applies both to data and to publications, but is playing out differently between the two. The “public monies for public goods” rationale has succeeded in the biomedical research community for the open deposit of publications, but not without resistance, especially on the part of publishers. Biomedical information has a substantial audience, including biomedical researchers, clinicians, the pharmaceutical industry, and patients, so it is not surprising that this was the first frontier for open access. Publications resulting from NIH funding must be deposited into PubMed Central within 12 months of publication, an embargo period that protects the journals (National Institutes of Health, 2005). However, with regard to data, NIH requires the release only for grants over a certain size, and allows these data to be embargoed for a certain period of time to enable the investigators to publish their findings. The NIH does not require that the data be deposited in any particular resource, only that they be released.
The Wellcome Trust, the largest funder of biomedical research in the United Kingdom, requires both publications and data from their grants to be made available. They support multiple types of open access publication, but do not follow the NIH model of requiring deposit in a specific repository (Wellcome Trust, 2005; Fazackerley, 2004).
The NSF policy is ambiguous with respect to what must be released. Publications are defined as a type of data within the scope of the Data Management Plan, but NSF does not specifically require that publications resulting from their grants be released openly (National Science Foundation, 2010a). In light of growing pressure for open access to publications, the situation may change (Directory of Open Access Journals, 2009; Open Content Alliance, 2009; Beaudouin-Lafon, 2010; Crow, 2009; Kaiser, 2008; Ware, 2010; Young, 2009). One result of the popularity of the NIH publication deposit requirements is a bill introduced into Congress to require federal research granting agencies to make resulting publications available to the public (Federal Research Public Access Act of 2012). In late 2011, the U.S. government solicited public input on policies for access to research data (Office of the Federal Register, 2011).
The release of data and publications are coupled both as an economic argument and as support for reproducibility, verification, and reuse. Publications are the primary form of documentation for most types of data. They explain the research problem addressed, the methods by which the data were collected, the analyses performed, and the interpretation of the results. Publications add value to data and vice versa (Borgman, 2007; Bourne, 2005; Pepe, Mayernik, Borgman, & Van de Sompel, 2010).
The Economic and Social Research Council, which is the primary U.K. funding agency for these areas, recently issued a data policy that goes well beyond the scope of the NSF, NIH, and Wellcome Trust requirements. The ESRC requires not only the public release of data but also that investigators who wish to create new data must “demonstrate that no suitable data are available for re-use” (Economic and Social Research Council, 2010, p. 3). For grants to create new data, ESRC requires a “data management and sharing plan.” Grantees also must prepare their data for “re-use and/or archiving” with an ESRC data provider within three months of the end of the award. If the grantees have not offered the data to an appropriate provider in that time period, then the agency may withhold the final payment on the award. The ESRC funds repositories that can curate data from their grants. Notably, the ESRC also acknowledges that the repositories can enforce selection policies for the data they collect, lest these repositories become a catch-all for anyone's data, regardless of quality or potential for future reuse.
The public monies for public good argument resonates with legislators, taxpayers, and the general public. It also resonates with researchers whose data are readily reusable by those without substantial domain knowledge, such as types of astronomical or earth observations that can support citizen science. Another public interest-driven argument is that data release will minimize duplication of research effort, which, in turn, results in fewer human subjects being required to establish findings (Fischer & Zigmond, 2010).
In most research projects, data are collected by individuals or by small teams. Methods are local and are specific to the research questions at hand. Reusing these types of data requires considerable knowledge of the procedures by which they were collected, which, in turn, requires considerable expertise in the research specialty. The farther removed from the data collection activity, the harder it is to make use of someone else's data. Thus, it is not surprising that concerns for the misinterpretation and misuse of data are common reasons that researchers give for not sharing (Campbell et al., 2002; Hilgartner, 1997; Hilgartner & Brandt-Rauf, 1994).
Researchers are more willing to share their data with those in their immediate area of specialty than with the general public. Those within their community of interest—to use the NSF term—have the expertise to interpret the data and thus are most likely to benefit from access. Making data available to the users beyond one's specialty requires much more documentation effort. Researchers are concerned about misuse, such as selective extraction of data points, or misinterpretation, whether because of lack of expertise, lack of documentation, or other factors. Data from observatories, where resources for documentation and curation usually are included in the research design and funding, are more readily released to the public. Observatory data are only a small subset of extant data resources, and they can be sensitive. Global and comparative research on climate change depends on open access to data (Overpeck, Meehl, Bony, & Easterling, 2011; Santer et al., 2011), yet the politicization of climate data (Costello, Maslin, Montgomery, Johnson, & Ekins, 2011; Gleick, 2011) makes researchers in these and in other fields wary of releasing their data.
To Enable Others to Ask New Questions of Extant Data
A more focused rationale is that sharing data enables others to ask new questions, whether from an individual dataset or by combining multiple sources. This framing has two strands, one for the benefit of researchers and one for the general public. Researchers have argued that open access to data encourages meta-analysis: the ability to combine data from multiple sources, times, and places to ask new questions (Whitlock, 2011). In this view, access to data is less about inspecting the findings of an individual project and more about the ability to combine data. Indeed, the greatest advantages of data sharing may be in the combination of data from multiple sources, compared or “mashed up” in innovative ways (Butler, 2006). Data are most reliably integrated when collected and processed systematically, in ways that support the standards of large communities. Common data structures, metadata formats, and ontologies help support mining and integration of multiple data sources.
The public good strand of the “ask new questions” rationale was framed most visibly by Chris Anderson, Editor-in-Chief of Wired magazine, in a special issue on data (Anderson, 2008):
The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.
Anderson captures the public excitement about the promise of “big data” to explore new questions and to combine data from multiple sources to identify new relationships. Although a popular phrase, big data is a problematic term that obscures the complexity, quality, and expertise required for analysis and interpretation. Ethical questions also arise about the use of public data for purposes other than those for which they were intended (Boyd & Crawford, 2011).
Those in the scientific and technical computing communities who argue for data sharing are well aware of the difficulties involved in reusing others’ data. Assessing the veracity and integrity of a given dataset requires domain expertise, and that assessment depends upon the extent of the documentation available (Faniel & Jacobsen, 2010). The farther the user is from the point of data origin, the more documentation that is required, the more effort required on the part of the reuser, and the greater the risk of misinterpretation. Scientists compute upon large datasets both for exploratory investigations and to generate new theory, which is quite the opposite of Anderson's conclusion that big data means “the end of theory.” As Edwards (2010), Rogers (1995), and many others have explained, data and theory are inseparable. Investigations are designed to test or to develop theories, and those theories are used to make sense of the data. Scholarship entails fitting data and theory.
This third rationale, to enable others to ask new questions of extant data, benefits prospective users more than producers of data. The data mashup strand of this argument falls in the upper right quadrant of Figure 3, as the intended users are a peer community of researchers, whereas data mining by anyone, for any reason, falls in the lower right quadrant, primarily benefiting the general public. Most researchers will share more readily with their peers, given the concerns for labor, interpretation, and likelihood of reuse.
To Advance the State of Research and Innovation
The rationale for sharing data that resonates with the widest array of stakeholders is that research and innovation can be advanced more effectively. This is the claimed “fourth paradigm,” that computational science constitutes a new set of methods beyond empiricism, theory, and simulation (Bell, Hey, & Szalay, 2009; Gray et al., 2005; Hey et al., 2009). The fourth paradigm claim is appealing, but tends to overreach. Wilbanks (2009) strikes a middle ground, viewing data not as a method per se, but as a rich resource for any of the empirical, theoretical, simulation, or computational paradigms.
One distinction between the “ask new questions” and “advance research” rationales is that the latter is wholly motivated by research interests. It also goes beyond asking new questions of extant data; it addresses the need for more data and for curation of existing data in ways that ensure their usefulness. Simply put, “science depends on good data” (Whitlock et al., 2010, p. 145) and “data are the main asset of economic and social research” (Economic and Social Research Council, 2010, p. 2). Assertions such as these are most common in data-intensive fields that benefit from observatories and synoptic surveys, such as astronomy and social science survey research, and fields in which comparisons across time and space are beneficial, such as some areas of biology and ecology. When data are shared quickly and openly, researchers can draw upon each other's data more readily. For example, some space-based telescope missions alert other astronomy projects when something of interest is spotted, enabling other investigators to turn their instruments toward the specified coordinates. Thus, one instrument might identify an object or event and an unrelated project might obtain follow-up observations within seconds (Drake et al., 2011),
Fischer and Zigmond (2010), writing in Science and Engineering Ethics, identify a number of ways in which data sharing advances the state of science. These include maximizing the use of data, increasing the impact of findings, progressing the state of research faster and farther, laying a broader foundation for knowledge, expanding the scope of research, and diversifying perspectives.
Data from publicly funded research are more likely to be shared than are data resulting from privately funded research, especially in cases where the research is proprietary. Academic researchers in fields where data have high monetary value and much of the research is proprietary, such as chemistry and the biosciences, are at a disadvantage in terms of access to data. Data sharing does occur within academe and between academe and industry, but to a lesser degree than in fields where most research relies on public funds (Haeussler, 2011; Lagoze & Velden, 2009a, 2009b). Establishing open data and metadata standards for chemistry has been highly contentious in comparison to other scientific fields (Murray-Rust & Rzepa, 2004). Open data structures facilitate the massing of large amounts of data that can be mined to ask new questions. In the biosciences, “the likelihood of sharing decreases with the competitive value of the requested information” for both academic and industrial researchers (Haeussler, 2011, p. 105). The humanities and social sciences have similar problems when their data sources are copyrighted materials owned by others. Scholars may be able to quote small portions of texts, but not reproduce still or moving images or mine digital resources in depth.
The argument that data sharing can advance scholarship goes beyond data release. Once made available, data can be curated in ways that add value for the research community. The notion that data curation is a means to advance science is the cornerstone for the Data Conservancy (2010), one of the consortia funded by the NSF DataNet Program (National Science Foundation, 2010c): “The Data Conservancy (DC) embraces a shared vision: scientific data curation is a means to collect, organize, validate and preserve data so that scientists can find new ways to address the grand research challenges that face society.”
The OECD principles have a similar tone: “Sharing and open access to publicly funded research data not only helps to maximise the research potential of new digital technologies and networks, but provides greater returns from the public investment in research” (Organisation for Economic Co-operation and Development, 2007, p. 10).
Advancing scholarship is a rationale that spans the interests of data producers and users, and thus is positioned in both upper quadrants of Figure 3. If data can be aggregated into a critical mass and curated in ways that make them more accessible and valuable, then those who produce the data can exploit them better, as can other data users. Researchers are more likely to share their data if they know the data will be well managed for future use. Data collected systematically to community standards for structure and content are most easily curated and compared, and researchers are most willing to share them. Many other types of valuable data do not meet these criteria, such as the beach quality and harmful algal blooms examples presented earlier. Although sharing these data also may advance scholarship, the researchers who produce those data are more likely to exchange them within their immediate areas of specialty than to release them to the general public.
Discussion and Conclusions
For the last 25 years, the need to share research data has been declared to be an urgent problem. Yet the discussion continues, policies proliferate, and evidence of data sharing is apparent in only a few research fields. Sharing research data is clearly a conundrum: an intricate and difficult problem. Acknowledging that data sharing is difficult does not mean abandoning all hope that some data will be shared with some people some of the time. The challenges are to understand which data might be shared, by whom, with whom, under what conditions, why, and to what effects. Answers to these questions will inform data policy and practice.
The complexity of the “simple statement” in the introduction to this article should by now be apparent:
If the rewards of the data deluge are to be reaped, then researchers who produce those data must share them, and do so in such a way that the data are interpretable and reusable by others.
Neither the producers of data nor the agencies that require sharing can agree on what are the data. Data take many forms, both physical and digital. They are much more than numbers in a spreadsheet: Data can be samples, software, field notes, code books, instrument calibrations, archival records, or a myriad of other information objects, none of which may stand alone. Sharing can encompass acts as varied as announcing the existence of data, posting them on a website, or contributing them to a richly curated repository. Interpretable and reusable are the most problematic terms. “Interpretable” presumes sufficient expertise to assess the integrity of the data and to grasp their meaning. It also presumes adequate documentation of the context of the data creation, processing, and provenance. “Reusable” is a standard just short of reproducibility. Considerable expertise, effort, restructuring, and proprietary software may be necessary to reuse data. Similarly problematic is the requirement to curate data in such ways that they are “independently understandable” (Consultative Committee for Space Data Systems, 2002). Although a laudable goal, it is rarely feasible in any absolute sense, any more than is reproducibility.
Disincentives to sharing research data include lack of reward or credit for sharing, the substantial amount of labor required to document data in reusable forms, concerns for misuse or misinterpretation of data, control over intellectual property, and the need to restrict access or to de-identify data on human subjects or endangered species. Perhaps the most significant challenge to data sharing is the lack of demonstrated demand for research data outside of genomics, climate science, astronomy, social science surveys, and a few other areas. Most funding agencies and review panels evaluate grant proposals on the basis of the new data to be created in support of a research endeavor. Few promote innovation through reuse of data; the ESRC of the United Kingdom is a notable exception. Hiring, tenure, promotion, and research assessment panels rarely consider data citation in their evaluations of research productivity. Until these many disincentives are addressed, data sharing is unlikely to increase substantially.
Four rationales for sharing research data are presented and are positioned in the model in Figure 3:
To reproduce or to verify research
To make the results of publicly funded research available to the public
To enable others to ask new questions of extant data
To advance the state of research and innovation
The first rationale is the strongest from a research perspective, and yet the most problematic. Questions of reproducibility are deeply intertwined with the epistemology of the research specialty. The second and third rationales are the most driven by public interests and are argued from the perspective of those who wish to use data produced by other parties. The fourth, which also serves the public, is framed in terms of benefits to data producers, and serves research, innovation, and scholarship.
The analysis presented here focuses most directly on the concerns of the researchers who produce data, in an effort to identify types of research in which data sharing is most appropriate and to identify policies and practices that may encourage data sharing. Motivations to release data depend to a large degree on the labor required, which varies both by the purposes for which data were collected (Figure 1) and the approaches to handling data (Figure 2). Data collected for observatories or for model building require structure and documentation, which makes them suitable for wider release. Instrumented data may contain automated markup that facilitates release. Hand-collected observations by individual investigators, whether ecological field studies or ethnographies, may require the most labor to document in sharable forms.
Data are more likely to be shared when the policies benefit those who produce the data. This is a simple statement of self-interest. Researchers collaborate, but they also compete for grants, for jobs, for publication venues, and for students. They must choose carefully where to spend their time and resources. Time and money spent on documenting data for use by others are resources not spent in data collection, analysis, equipment, publication fees, conference travel, writing papers and proposals, or other research necessities. Although it can be argued that good data practices benefit the originating researcher, far less documentation is required to maintain data for one's own use than to release those data to the public. Data release is costly. Even if data sharing is built into the cost of research funding, the requirement may substantially increase the cost of doing research. Data release is more effective if those data are curated in ways that make them useful to others over some long period of time. Data curation likewise is very expensive, and unlikely to be justifiable for all forms of data. Issues of selection and appraisal to determine which data are worth curating and for how long are urgent matters that require much more attention. Similarly, more needs to be known about potential uses and users of research data. One reason that researchers do not release data is because they cannot imagine who might use them (Mayernik, 2011); they lack a “recursive public,” to use Kelty's term (Kelty, 2008, p. 3).
However, these concerns beg the question of what are the data in any given investigation. Releasing the spreadsheets or statistical files associated with tables in a journal article is a much different requirement than is releasing physical samples, hand-written field notebooks, or raw observations from complex instruments. Numerical data are of little value without the software associated with the data collection, analysis, and processing technologies. That software may be proprietary or it may be crafted locally as part of the research project. In either case, the software necessary to interpret the data may not be sharable. Even if the data were produced with common tools, those data may be unreadable after a few years because of changes in hardware and software, unless they have been curated well and migrated to new technologies. Yet more problematic is the fact that many new forms of research data are not datasets that exist in bounded forms that can be curated. Rather, they are streams of observations flowing from sensor networks, telescopes, social networks, public cameras, and countless other monitoring technologies. Any such dataset that might be shared is at best a snapshot in time.
Funding agencies’ policies for sharing data acknowledge that requirements may differ by research community. Identifying appropriate policies is challenging because the “research community” does not speak with one voice. Nor does the “astronomy community,” the “biology community,” the “sociology community,” or any other specialty. Collaborative research may cross the boundaries of disciplines, specialties, universities, and countries. Learning the interests of a given community, however narrowly or broadly defined, requires close engagement and study. The social study of science dates to the mid-20th century (Latour & Woolgar, 1979; Merton, 1969, 1973), and the interest in practices associated with data has accelerated in the last decade (Borgman, 2007; Bowker, 2005; Edwards, 2010). Social science and humanities research practices have received far less attention; more studies of these also are needed (Borgman, 2009).
Initiatives such as the NSF DataNet program (National Science Foundation, 2010c) endeavor to bring researchers, librarians, archivists, data scientists, and systems developers together to understand community-driven design for data curation. Multiple, parallel studies of individual research groups and communities are under way to inform both policy and design. Our research on astronomers, as part of the DataNet program (Data Conservancy, 2010), reveals that community-driven design means selecting and organizing data to reflect specific practices (Wynholds, 2010; Wynholds et al., 2011). At one extreme, very fine details of instrument design and calibration must be associated with data. Multidimensional temporal and spatial coordinates also may be essential. At the other extreme, researchers would like to be able to explore massive repositories of data without having to know those fine details. To paraphrase one of the astronomers we interviewed, “Only about 10% of all our data has ‘eyes on.’ We rely on analytical tools to see the rest of it.” Several have expressed concern over the design of current data repositories, which may be optimized for database performance rather than for scientific inquiry. Similar insights likely exist for any field whose data may be curated; they await study and partnership.
The reasons that researchers do not share data readily are becoming better understood. However, even less is known about why researchers do share data and why they reuse data. Thus, much more research is needed about practices in fields that do share data consistently and practices in fields where data are consistently reused. Only with this knowledge in hand, coupled with a richer understanding of the array of physical and digital objects that might be considered data, can better policies, practices, services, and systems be developed to support the sharing of research data.
This article originated in an unpublished conference paper presented in China (Borgman, 2010). Ideas were developed further through presentations to the National Research Council Board on Research Data and Information and to the Santa Fe Institute. My writing of this article benefited greatly from discussions and comments on earlier drafts by the CENS Data Management team at UCLA—David Fearon, Matthew Mayernik, Alberto Pepe, Elizabeth Rolando, Ashley Sands, Katie Shilton, Jillian Wallis, and Laura Wynholds; collaborators Sharon Traweek (UCLA), Catherine van Ingen, and Catherine Marshall (Microsoft Research); the Monitoring, Modeling, and Memory research team—Paul Edwards, Thomas Finholt, Steven Jackson, Archer Batcheller, and Ayse Buyuktur (Michigan), Geoffrey Bowker (Pittsburgh), and David Ribes (Georgetown); and numerous discussions with Paul Uhlir (National Academies of Science) and Victoria Stodden (Columbia). Monica Pamela Garcia (UCLA) provided expert bibliographic assistance and fact checking. George Djorgovski (Caltech), Alyssa Goodman (Harvard), Kathryn Mika (UCLA), Alex Szalay (Johns Hopkins), and many anonymous research subjects provided examples of data production and use that are mentioned herein. Jillian Wallis provided the artwork for the figures, improving greatly upon my sketches.
Anonymous reviewers for this journal provided helpful improvements. Subsequent revisions to this article benefited from discussions with Ian Foster, Carl Kesselman, Michael Nielsen, and others at the July 2011 Workshop, Accelerating Discovery: Human-Computer Symbiosis 50 Years On, and with Joshua Greenberg, Chris Mentzel, MacKenzie Smith, John Wilbanks, and others at the November 2011 National Science Foundation (NSF) Workshop on Research and Resource Commons in Scientific Research. Thanks are due to these workshop organizers.
Research reported here is supported in part by grants from the National Science Foundation (NSF): (1) The Center for Embedded Networked Sensing (CENS) is funded by National Science Foundation (NSF) Cooperative Agreement #CCR-0120778, Deborah L. Estrin, UCLA, Principal Investigator; (2) Towards a Virtual Organization for Data Cyberinfrastructure, #OCI-0750529, C.L. Borgman, UCLA, PI; G. Bowker, Santa Clara University, Co-PI; Thomas Finholt, University of Michigan, Co-PI; (3) Monitoring, Modeling & Memory: Dynamics of Data and Knowledge in Scientific Cyberinfrastructures: #0827322, P.N. Edwards, UM, PI; Co-PIs C.L. Borgman, UCLA; G. Bowker, SCU and Pittsburgh; T. Finholt, UM; S. Jackson, UM; D. Ribes, Georgetown; S.L. Star, SCU and Pittsburgh; and (4) The Data Conservancy, National Science Foundation (NSF) Cooperative Agreement (DataNet) award OCI0830976, Sayeed Choudhury, PI, Johns Hopkins University. We also are grateful to Microsoft Technical Computing and External Research for gifts in support of this research program.