As is evident from the above discussion of the purposes and approaches to handling data, investigators (and their collaborators, students, and staff) devote massive amounts of physical and intellectual labor to collecting, managing, and analyzing their data and to publishing their results. Data are the lifeblood of research in any field, but just what are those data varies by purpose, approach, instrumentation, community, and many other local and global considerations. Some of those data may be in sharable forms, others not. Some data are of recognized value to the community, others not. Some researchers wish to share all of their data all of the time, some wish never to share any of their data, and most are willing to share some of their data some of the time. These competing perspectives, the array of data types and origins, and the variety of local circumstances all contribute to the intricacy and difficulty of sharing data.
To Reproduce or to Verify Research
Reproducibility or replication of research is viewed as “the gold standard” for science (Jasny, Chin, Chong, & Vignieri, 2011), yet it is the most problematic rationale for sharing research data. This rationale is fundamentally research driven but can also be viewed as serving the public good. Reproducing a study confirms the science, and in doing so confirms that public monies were well spent. However, the argument can be applied only to certain kinds of data and types of research, and rests upon several questionable assumptions.
Pressure is mounting to share data for the purposes of reproducing research findings. A recent special issue of Science on replication and reproducibility examines the approaches, benefits, and challenges across multiple fields (Ioannidis & Khoury, 2011; Jasny et al., 2011; Peng, 2011; Ryan, 2011; Santer, Wigley, & Taylor, 2011; Tomasello & Call, 2011). The authors encourage data sharing to increase the likelihood of replication, while acknowledging the very different methods and standards for reproducibility in each field discussed. Particularly challenging are the “omics” fields (e.g., genomics, transcriptomics, proteomics, metabolomics), in which “clinically meaningful discoveries are hidden within millions of analyses” (Ioannidis & Khoury, 2011, p. 1230). Fine distinctions are made between reproducibility, validation, utility, replication, and repeatability, each of which has distinct meaning in individual omics fields.
The Wall Street Journal, in a front-page article reviewing this special issue of Science, oversimplified the concern by stating that “reproducibility is the foundation of all modern research, the standard by which scientific claims are evaluated” (Naik, 2011). Naik's article provides multiple examples of biomedical companies spending tens or hundreds of millions of dollars to reproduce research reported in journal articles, often without success. In this view, lack of reproducibility is a flaw in science, often traceable to problems such as insufficient control groups or to the difficulty of publishing negative results. However, by the 1970s, research in the sociology of science was describing the difficulties of verifying scientific results. Harry Collins studied the role of replication in multiple scientific disputes, providing numerous examples of disagreements about validation methods (Collins, 1983). In extensive studies of gravitational waves in physics, for example, he found that some scientists believed that only the experiments that detected these waves were performed appropriately, whereas other scientists trusted only the experiments that failed to detect such waves (Collins, 1975, 1998).
Peer review rests on expert judgment rather than on reproducibility. Reviewers or referees are expected to assess the reliability and validity of a research report based on the information provided. In only a few fields do reviewers attempt to reanalyze or verify data or to reconstruct all the steps in a mathematical proof or other procedure. Even when data are included with a journal article or conference paper, rarely is enough information provided to reproduce the results. Instrument details and calibration may be omitted, or lab-specific practices may not be documented in sufficient detail. This is normal practice, both because journal space constraints discourage elaborate methods sections and because research expertise relies upon tacit knowledge that is not easily documented (Bowker, 2005; Collins, 1983; Kanfer et al., 2000; Latour, 1987; Latour & Woolgar, 1979, 1986). Yet whenever published articles are withdrawn from major journals, questions are raised about what the reviewers knew—or should have known—about the data and procedures (Brumfiel, 2002; Couzin & Unger, 2006; Couzin-Frankel, 2010; Normile, Vogel, & Couzin, 2006).
Reproducibility is a high bar, and even it has several levels, such as the precise duplication of observations or experiments, exact replication of a software workflow, degree of effort necessary, and whether proprietary tools are required (Reproducible research, 2010; Stodden, 2009a, 2009b; Vandewalle, Kovacevic, & Vetterli, 2009). Observations of the real world—such as water samples, algae, air and soil temperature, or comets—are associated with times and places, and thus rarely can they be reproduced. It is for this reason that observations are worth curating. Experimental observations may be reproduced in laboratory conditions, given sufficient documentation and access to the same materials, equipment, software, and technical expertise. Given the same set of observations, whether natural or experimental, other researchers may be able to replicate the results, if identical circumstances can be achieved. However, experienced researchers know that minute differences in procedures, machine calibrations, temperature and humidity near the instruments, and other factors can introduce undetected variation that influences replicability and interpretation.
Even the research activities are difficult to replicate, because processing and analysis tools such as statistical, mathematical, and workflow software rarely maintain a precise record of the systems or the data at each transaction (Claerbout, 2010; Goble & De Roure, 2009). Questions of the provenance, in both the archival sense of “chain of custody” and the computing sense of “transformations from original state,” arise in interpreting data that may have evolved through multiple hands and processes (Buneman, Khanna, & Tan, 2000; Gil, 2010; Hunter & Cheung, 2007).
Other impediments to reproducibility include difficulties in gaining access to data and to tools used to create and analyze those data, lack of a licensing regime to provide access to proprietary software and to data (Stodden, 2009b), and conflicts in copyright law, such as differences between the United States and Europe in the ability to make proprietary claims on factual matters (Reichman & Uhlir, 2003).
Reproducibility is elusive because of the vagaries of the research process, the varying notions of data, disparate community practices for documenting evidence, the transitory nature of observations, and the combinations of expertise necessary for interpretation of evidence. As a rationale for sharing research data, reproducibility is problematic not only because it applies to so few types of research, but because it risks reducing the research process to a set of mechanistic procedures. Even chemists will acknowledge that their work is as much art as it is science (Lagoze & Velden, 2009a, 2009b). True reproducibility requires deep engagement with the epistemological questions of a given research specialty, and the very different ways in which investigators obtain and value evidence. Reusers of data may not know, or be able to know, what prior actors did to the data. Each step in cleaning or processing data requires judgments, few of which may be fully documented. Later interpretations thus may depend upon multilevel inferences that are statistically problematic (Meng, 2010).
Although reproducibility is a popular term in promoting data release, the concept incorporates a wide array of interpretations, including replication, repeatability, validation, and verification. While a less precise notion, verification may be easier to accomplish. Peer reviewers and readers can judge whether the research methods meet community standards, whether the evidence is appropriate for the claim and whether the arguments are reasonable, even if they are unable to reconstruct all of the procedures. Potential reusers of data also can apply their own judgments about the veracity and usefulness of the data; their criteria may be highly specific to their research topic (Faniel & Jacobsen, 2010; Wynholds et al., 2011).
Motivations to share research data for the purposes of reproducibility or verification vary widely by type and condition of the data and expectations of the community. Where data deposit is required as a condition of publication, as in the case of genomes and the Protein Data Bank discussed earlier, researchers will comply. If reproducibility requires materials that are very difficult to share, such as specialized animals or cell lines, then community practices will dictate the conditions of data release. Considerable human expertise, labor, and licensing of intellectual property may be involved. Researchers give these reasons and many others for refusing to release data (Campbell et al., 2002; Hilgartner, 1997, 1998, 2002; Hilgartner & Brandt-Rauf, 1994). If materials and documentation are highly automated, if no licensing restrictions apply, and if the researcher has completed his or her publication of the results, then data sharing is more likely to occur.
However, researchers may never be “done” with their data. In cases where a research career is based on long-term study of a specific species, locale, or set of artifacts, data become more valuable as they cumulate. Researchers in these situations may be particularly reluctant to release data associated with a specific publication, as it might mean releasing many years of data. Similarly, reproducing the data associated with any given publication is problematic, as the set of observations reported may depend heavily on prior studies and on interpretation of much earlier data.
Scholars may wish to verify findings either to build upon or to refute them. The generalized rationale of sharing data for reproducibility or for verification of results lies near the research-driven pole of the argument dimension in Figure 3. It spans the interests of data producers, who can benefit by having their findings verified by others to reinforce their veracity, and the interests of researchers who would use others’ data. The greater possibility of replication or verification occurs with data produced for the purposes of observatories, modeling systems, or for theory building (the far end of the dimensions in Figure 1), or those collected by large teams, by technologies, and subject to machine processing (the far poles of Figure 2). These categories of data are more likely to be captured and described consistently than are data collected for exploratory investigations, to describe phenomena, for empirical studies, by individual investigators, or processed by hand. Innovation in research requires new methods of research design, analysis, and technology.
From an epistemological perspective, reproducibility and verification are the most problematic of the four arguments for sharing data. Often the research creativity lies in identifying a new method required to approach an old problem. Research outcomes often depend much more on interpretation than on the data per se. Separating data from context is a risky matter that must be balanced carefully against demands for reproducibility.
To Make Results of Publicly Funded Research Available to the Public
Public sentiment for sharing research data is based largely on the rationale that tax monies should be leveraged to serve the public good. In this view, data produced with public funds should be available for use and should not be hoarded by researchers. The public good argument is implicit in the OECD principles, to which several of the funding agency policies refer, namely, that open access to research data is a means to leverage public investment in research (Organisation for Economic Co-operation and Development, 2007). The OECD document also builds upon an earlier U.S. study, explicitly quoting this passage: “The value of data lies in their use. Full and open access to scientific data should be adopted as the international norm for the exchange of scientific data derived from publicly funded research” (National Research Council, 1997).
U.S. public policy tends toward openness of research information. Federal law waives copyright protection on data and information directly produced by government agencies, putting those materials into the public domain in the United States (Reichman & Uhlir, 2003). Data and information resulting from research grants to universities and other agencies do not fall under the same law. Data and information produced by European governments generally do not fall into the public domain, and some of their database laws are more restrictive than those of the United States (Boyle, 2004; Boyle & Jenkins, 2003). However, initiatives such as EMODNET and INSPIRE in the environmental sciences are building infrastructure to combine and mine data throughout the European Union (INSPIRE, 2007; European Marine Observation and Data Network, 2011).
The rationale of making publicly funded research available to the public applies both to data and to publications, but is playing out differently between the two. The “public monies for public goods” rationale has succeeded in the biomedical research community for the open deposit of publications, but not without resistance, especially on the part of publishers. Biomedical information has a substantial audience, including biomedical researchers, clinicians, the pharmaceutical industry, and patients, so it is not surprising that this was the first frontier for open access. Publications resulting from NIH funding must be deposited into PubMed Central within 12 months of publication, an embargo period that protects the journals (National Institutes of Health, 2005). However, with regard to data, NIH requires the release only for grants over a certain size, and allows these data to be embargoed for a certain period of time to enable the investigators to publish their findings. The NIH does not require that the data be deposited in any particular resource, only that they be released.
The Wellcome Trust, the largest funder of biomedical research in the United Kingdom, requires both publications and data from their grants to be made available. They support multiple types of open access publication, but do not follow the NIH model of requiring deposit in a specific repository (Wellcome Trust, 2005; Fazackerley, 2004).
The NSF policy is ambiguous with respect to what must be released. Publications are defined as a type of data within the scope of the Data Management Plan, but NSF does not specifically require that publications resulting from their grants be released openly (National Science Foundation, 2010a). In light of growing pressure for open access to publications, the situation may change (Directory of Open Access Journals, 2009; Open Content Alliance, 2009; Beaudouin-Lafon, 2010; Crow, 2009; Kaiser, 2008; Ware, 2010; Young, 2009). One result of the popularity of the NIH publication deposit requirements is a bill introduced into Congress to require federal research granting agencies to make resulting publications available to the public (Federal Research Public Access Act of 2012). In late 2011, the U.S. government solicited public input on policies for access to research data (Office of the Federal Register, 2011).
The release of data and publications are coupled both as an economic argument and as support for reproducibility, verification, and reuse. Publications are the primary form of documentation for most types of data. They explain the research problem addressed, the methods by which the data were collected, the analyses performed, and the interpretation of the results. Publications add value to data and vice versa (Borgman, 2007; Bourne, 2005; Pepe, Mayernik, Borgman, & Van de Sompel, 2010).
The Economic and Social Research Council, which is the primary U.K. funding agency for these areas, recently issued a data policy that goes well beyond the scope of the NSF, NIH, and Wellcome Trust requirements. The ESRC requires not only the public release of data but also that investigators who wish to create new data must “demonstrate that no suitable data are available for re-use” (Economic and Social Research Council, 2010, p. 3). For grants to create new data, ESRC requires a “data management and sharing plan.” Grantees also must prepare their data for “re-use and/or archiving” with an ESRC data provider within three months of the end of the award. If the grantees have not offered the data to an appropriate provider in that time period, then the agency may withhold the final payment on the award. The ESRC funds repositories that can curate data from their grants. Notably, the ESRC also acknowledges that the repositories can enforce selection policies for the data they collect, lest these repositories become a catch-all for anyone's data, regardless of quality or potential for future reuse.
The public monies for public good argument resonates with legislators, taxpayers, and the general public. It also resonates with researchers whose data are readily reusable by those without substantial domain knowledge, such as types of astronomical or earth observations that can support citizen science. Another public interest-driven argument is that data release will minimize duplication of research effort, which, in turn, results in fewer human subjects being required to establish findings (Fischer & Zigmond, 2010).
In most research projects, data are collected by individuals or by small teams. Methods are local and are specific to the research questions at hand. Reusing these types of data requires considerable knowledge of the procedures by which they were collected, which, in turn, requires considerable expertise in the research specialty. The farther removed from the data collection activity, the harder it is to make use of someone else's data. Thus, it is not surprising that concerns for the misinterpretation and misuse of data are common reasons that researchers give for not sharing (Campbell et al., 2002; Hilgartner, 1997; Hilgartner & Brandt-Rauf, 1994).
Researchers are more willing to share their data with those in their immediate area of specialty than with the general public. Those within their community of interest—to use the NSF term—have the expertise to interpret the data and thus are most likely to benefit from access. Making data available to the users beyond one's specialty requires much more documentation effort. Researchers are concerned about misuse, such as selective extraction of data points, or misinterpretation, whether because of lack of expertise, lack of documentation, or other factors. Data from observatories, where resources for documentation and curation usually are included in the research design and funding, are more readily released to the public. Observatory data are only a small subset of extant data resources, and they can be sensitive. Global and comparative research on climate change depends on open access to data (Overpeck, Meehl, Bony, & Easterling, 2011; Santer et al., 2011), yet the politicization of climate data (Costello, Maslin, Montgomery, Johnson, & Ekins, 2011; Gleick, 2011) makes researchers in these and in other fields wary of releasing their data.
To Enable Others to Ask New Questions of Extant Data
A more focused rationale is that sharing data enables others to ask new questions, whether from an individual dataset or by combining multiple sources. This framing has two strands, one for the benefit of researchers and one for the general public. Researchers have argued that open access to data encourages meta-analysis: the ability to combine data from multiple sources, times, and places to ask new questions (Whitlock, 2011). In this view, access to data is less about inspecting the findings of an individual project and more about the ability to combine data. Indeed, the greatest advantages of data sharing may be in the combination of data from multiple sources, compared or “mashed up” in innovative ways (Butler, 2006). Data are most reliably integrated when collected and processed systematically, in ways that support the standards of large communities. Common data structures, metadata formats, and ontologies help support mining and integration of multiple data sources.
The public good strand of the “ask new questions” rationale was framed most visibly by Chris Anderson, Editor-in-Chief of Wired magazine, in a special issue on data (Anderson, 2008):
The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.
Anderson captures the public excitement about the promise of “big data” to explore new questions and to combine data from multiple sources to identify new relationships. Although a popular phrase, big data is a problematic term that obscures the complexity, quality, and expertise required for analysis and interpretation. Ethical questions also arise about the use of public data for purposes other than those for which they were intended (Boyd & Crawford, 2011).
Those in the scientific and technical computing communities who argue for data sharing are well aware of the difficulties involved in reusing others’ data. Assessing the veracity and integrity of a given dataset requires domain expertise, and that assessment depends upon the extent of the documentation available (Faniel & Jacobsen, 2010). The farther the user is from the point of data origin, the more documentation that is required, the more effort required on the part of the reuser, and the greater the risk of misinterpretation. Scientists compute upon large datasets both for exploratory investigations and to generate new theory, which is quite the opposite of Anderson's conclusion that big data means “the end of theory.” As Edwards (2010), Rogers (1995), and many others have explained, data and theory are inseparable. Investigations are designed to test or to develop theories, and those theories are used to make sense of the data. Scholarship entails fitting data and theory.
This third rationale, to enable others to ask new questions of extant data, benefits prospective users more than producers of data. The data mashup strand of this argument falls in the upper right quadrant of Figure 3, as the intended users are a peer community of researchers, whereas data mining by anyone, for any reason, falls in the lower right quadrant, primarily benefiting the general public. Most researchers will share more readily with their peers, given the concerns for labor, interpretation, and likelihood of reuse.
To Advance the State of Research and Innovation
The rationale for sharing data that resonates with the widest array of stakeholders is that research and innovation can be advanced more effectively. This is the claimed “fourth paradigm,” that computational science constitutes a new set of methods beyond empiricism, theory, and simulation (Bell, Hey, & Szalay, 2009; Gray et al., 2005; Hey et al., 2009). The fourth paradigm claim is appealing, but tends to overreach. Wilbanks (2009) strikes a middle ground, viewing data not as a method per se, but as a rich resource for any of the empirical, theoretical, simulation, or computational paradigms.
One distinction between the “ask new questions” and “advance research” rationales is that the latter is wholly motivated by research interests. It also goes beyond asking new questions of extant data; it addresses the need for more data and for curation of existing data in ways that ensure their usefulness. Simply put, “science depends on good data” (Whitlock et al., 2010, p. 145) and “data are the main asset of economic and social research” (Economic and Social Research Council, 2010, p. 2). Assertions such as these are most common in data-intensive fields that benefit from observatories and synoptic surveys, such as astronomy and social science survey research, and fields in which comparisons across time and space are beneficial, such as some areas of biology and ecology. When data are shared quickly and openly, researchers can draw upon each other's data more readily. For example, some space-based telescope missions alert other astronomy projects when something of interest is spotted, enabling other investigators to turn their instruments toward the specified coordinates. Thus, one instrument might identify an object or event and an unrelated project might obtain follow-up observations within seconds (Drake et al., 2011),
Fischer and Zigmond (2010), writing in Science and Engineering Ethics, identify a number of ways in which data sharing advances the state of science. These include maximizing the use of data, increasing the impact of findings, progressing the state of research faster and farther, laying a broader foundation for knowledge, expanding the scope of research, and diversifying perspectives.
Data from publicly funded research are more likely to be shared than are data resulting from privately funded research, especially in cases where the research is proprietary. Academic researchers in fields where data have high monetary value and much of the research is proprietary, such as chemistry and the biosciences, are at a disadvantage in terms of access to data. Data sharing does occur within academe and between academe and industry, but to a lesser degree than in fields where most research relies on public funds (Haeussler, 2011; Lagoze & Velden, 2009a, 2009b). Establishing open data and metadata standards for chemistry has been highly contentious in comparison to other scientific fields (Murray-Rust & Rzepa, 2004). Open data structures facilitate the massing of large amounts of data that can be mined to ask new questions. In the biosciences, “the likelihood of sharing decreases with the competitive value of the requested information” for both academic and industrial researchers (Haeussler, 2011, p. 105). The humanities and social sciences have similar problems when their data sources are copyrighted materials owned by others. Scholars may be able to quote small portions of texts, but not reproduce still or moving images or mine digital resources in depth.
The argument that data sharing can advance scholarship goes beyond data release. Once made available, data can be curated in ways that add value for the research community. The notion that data curation is a means to advance science is the cornerstone for the Data Conservancy (2010), one of the consortia funded by the NSF DataNet Program (National Science Foundation, 2010c): “The Data Conservancy (DC) embraces a shared vision: scientific data curation is a means to collect, organize, validate and preserve data so that scientists can find new ways to address the grand research challenges that face society.”
The OECD principles have a similar tone: “Sharing and open access to publicly funded research data not only helps to maximise the research potential of new digital technologies and networks, but provides greater returns from the public investment in research” (Organisation for Economic Co-operation and Development, 2007, p. 10).
Advancing scholarship is a rationale that spans the interests of data producers and users, and thus is positioned in both upper quadrants of Figure 3. If data can be aggregated into a critical mass and curated in ways that make them more accessible and valuable, then those who produce the data can exploit them better, as can other data users. Researchers are more likely to share their data if they know the data will be well managed for future use. Data collected systematically to community standards for structure and content are most easily curated and compared, and researchers are most willing to share them. Many other types of valuable data do not meet these criteria, such as the beach quality and harmful algal blooms examples presented earlier. Although sharing these data also may advance scholarship, the researchers who produce those data are more likely to exchange them within their immediate areas of specialty than to release them to the general public.