Much has recently been written about how open access publishing and open access to data represent a scientific revolution (e.g., see issue 7442 issue of Nature from 26 March 2013, and Reichman et al. 2011). However, in our view, it is critical that the advent of open access to massive, multiple data sets, and the opportunities created by “Big Science” (Reichman et al. 2011, Hampton et al. 2013) must not subvert the critical need to maintain the principles and practice of good science. Otherwise we will open the door to a generation of “junk science,” with a massive loss of public credibility.
Many funding bodies, such as the National Science Foundation and publishers (e.g., the Ecological Society of America) now demand that data are published or are publicly available at the time of publication. Indeed, there are now even entire journals dedicated solely to publishing data sets (e.g., see http://proj.badc.rl.ac.uk/preparde/blog/DataJournalsList). We agree with many others that it is critical that the data sets gathered from research endeavors are stored and carefully curated (e.g. Pullin and Salafsky 2010, Reichman et al. 2011) with high levels of quality assurance and quality control, and clarity with regard to site location, etc. We also recognize that some important discoveries can be made by integrating large data sets from many systems; the global collapse of bee and frog populations are classic cases (Garibaldi et al. 2011, Hof et al. 2011). Nevertheless, we suggest that good science must be underpinned by posing good questions (which is often hard to do; see Peters 1991), and then designing and executing robust experimental studies to answer those questions (which is also often a very difficult task) (Lindenmayer and Likens 2010).
However, increasing opportunities to collect massive data sets mean that there is a distinct temptation to do science backwards. That is, now that we have all the data, what question shall we ask? Such kinds of science “fishing trips” (data mining) that try to make sense of a blizzard of ecological details can lead to important phenomena being overlooked (such as occurred in vital work on the hole in the ozone layer (Shanklin 2010)). But perhaps more importantly, fishing trips also can lead to results and conclusions that are just plain wrong. Indeed, in our discipline of ecology, there is an increasing number of examples where increased knowledge is missed or even where substantially flawed papers are being published, in part because authors had limited or no understanding of the data sets they were using, nor any experience of the ecosystems or other entities about which they have written.
Sadly, traditional approaches like the peer review system (which is clearly overwhelmed by the current deluge of submissions [Priem 2013]) and difficulty in finding reviewers (e.g., see Lajtha and Baveye 2010), and the rebuttal system, have limited capacity to rectify these problems (Banobi et al. 2011). Our extensive experience from a combined 80 years of collecting empirical data is that large data sets are often nuanced and complex, and appropriate analysis of them requires intimate knowledge of their context and substance to avoid making serious mistakes in interpretation. We therefore suggest that it is essential that those intending to use large, composite open-access data sets must work in close collaboration with those responsible for gathering those data sets. They also must formally acknowledge those who gathered the data, through co-authorship, attribution, or citation. Indeed, citation of data sets might even become a form of recognition and a metric for gauging scientific and academic success (Reichman et al. 2011).
The problems of context-free, retrofitted question approaches have been raised by others like Whittaker (2010) in the context of exposés on some not-so-clever meta-analysis, created in part because authors lacked scientific experience and ecological context in what they were doing. There is also the emerging issue of a generation of what we term here as “parasitic” scientists who will never be motivated to go and gather data because it takes real effort and time and it is simply easier to use data gathered by others. The pressure to publish and extraordinary levels of competition at universities and other institutions (Lindenmayer and Likens 2011) will continue to positively select for such parasitic scientists. This approach to science again has the potential to lead to context-free, junk science. More importantly, it may create massive disincentives for others to spend the considerable time and effort required to collect new data.
In conclusion, we suggest that while open access to massive data sets offers exciting and unprecedented opportunities for scientific discovery, in taking up these opportunities we must remain cognizant of the need to maintain the principles and practices of good science driven by well-developed questions. Ready access to large data sets does not mean that, without prior development of good questions, those data are necessarily useful. We need to pose questions first, and then determine which data are suited and well matched to answering those questions. There will, of course, be many cases and particular problems where new scientific questions will need to be posed, and correspondingly, new data will need to be gathered as part of finding scientific answers to those questions. These new data will lead to discoveries, and, in the disciplines of ecology and the environment, these breakthroughs will be more likely and best interpreted correctly when there is experience with the target ecosystem.