Data: a timely topic for ecology
The importance of detailed, accurate record keeping has long been championed in all fields of science, and the traditional physical embodiment of this value on detail is the laboratory notebook. Within the famous notebooks of historical figures such as Darwin, Galileo, and Da Vinci, the reader finds both data and theory, interwoven and exhaustively described. These notebooks, combined with correspondence and publications, created a record of the development of a scientific idea and the foundation of scientific progress (e.g., Costa 2009).
We might expect that scientific records are growing richer as science becomes digitized; however, scientific data and the descriptions of those data (metadata) are often decoupled in the digital age. Data management is more difficult, not easier, when spreadsheets are used to manage data, while metadata are maintained elsewhere—in hand-written notebooks, electronic documents, scraps of paper, etc. In essence, the traditional laboratory notebook has been fractured, with data and metadata spread across multiple disconnected electronic formats and hardcopies (Butler 2005).
The need for modernization of data management practices in ecology and other disciplines has received increasing attention. An array of tools for managing data and metadata have emerged to meet this need; examples include online laboratory notebook systems and scientific workflow systems (Barseghian et al. 2010, Jones and Gries 2010) and software tools for creating metadata such as Morpho for Ecological Metadata language (Jones et al. 2001). An upswing in concerns about data, its management, and how to properly share and archive it, may be attributed to the National Science Foundation's new requirement that a two-page data management plan be submitted as a supplement to each proposal. Scientists are now required to describe the data they will collect, the policies they will adhere to, and how they will manage and archive it. Of course, the NSF requirement may also be a response to the fact that managing and archiving digital data is an increasingly important task for scientists, and that scientists should think about their strategies early in a project.
There are other motivators for scientists to educate themselves about data management. Journals and publishers are now exploring requirements for data sharing as a condition of publication (Ellison 2010, Whitlock 2010). Big data and the data deluge are flooding scientists with more data than they can process (Maurer et al. 2000, Carlson 2006; Hampton et al., in press), and certainly more than they can print out and staple into their lab notebooks. Further, there are calls for more openness in science in general, especially in light of recent scientific scandals related to reproducibility (Brumfiel 2010, Lancet Editorial 2010, Nature Editorial 2010, Pennisi 2010).
Given these motivators, scientists are seeking assistance in properly managing, storing, and archiving their data. We were interested in how this is translating into education of future ecologists. As we transition to an era of better digital data management, are up-and-coming scientists being trained in best practices? Are they learning about data, metadata, and reproducibility? In this study, we sought answers to these questions by surveying instructors of undergraduate ecology courses at institutions likely to be training future ecology graduate students.
Focus on ecology
We chose to focus on the ecology discipline in this study for several reasons. First, ecologists have been known to resist changes to their traditional methods and training (Aronova et al. 2010). Second, there is a small culture of data sharing and archiving in ecology, compared to disciplines such as genomics, physics, and other sciences (McCain 1991, Nelson 2009, Hampton et al. 2012; Hampton et al., in press). Ecological data are diverse, consisting of many small, unique data sets that were collected using varied methods (NRC 1995, Bowker 2000, Michener et al. 2007, Zimmerman 2007). This lack of standardization makes data harder to interpret and integrate (Zimmerman 2008) and more costly to manage, but should not be considered sufficient reason to avoid data sharing. Ecology is increasingly participating in the digital information age, and this trend will only continue.
From the supply side, there is increasing availability of online data. The Long Term Ecological Research (LTER) Network houses over 6,000 datasets—a resource that was not available a decade ago when many of today's instructors were undergoing their training (Peters 2010, Porter 2010, Michener et al. 2011). Projects such as the National Ecological Observatory Network (NEON) and the U.S. Integrated Ocean Observing Systems (Baptista et al. 2008) are evidence of growing interest in large-scale data that can only be properly managed using cloud computing and databases and will require sophisticated computing skills. In order to participate in the future of the discipline, ecologists must be capable of documenting their data in standardized formats, creating machine-readable metadata that conform to their discipline's standards, and making their data publicly accessible (Hampton et al. 2012). Digital data manipulation, analysis, and management have become a required basic research skill for all ecologists, similar to writing a coherent sentence.
Current state of data management education
There is relatively little access to training on how to produce and document data sets so that others can find, understand, and re-use them (Cook et al. 2001). Better education on these topics will go a long way towards instilling in future scientists a deeper appreciation for the value of information about data (metadata) (Michener 2006). The scientific notebook is generally a part of undergraduate education, as are the scientific method and concepts such as reproducibility, but few courses exist that are exclusively devoted to data management practices for scientists. Spreadsheets are often used as the basis of data collection and education; but this is potentially problematic since spreadsheets typically do not promote good data management practices (Jones et al. 2006). The features of spreadsheets that make them desirable for the average researcher, such as extensibility, use of formatting for organization, embedding charts, make them undesirable for preparing data for long-term archiving and re-use. Despite these drawbacks, spreadsheets are the most commonly used software tool in undergraduate ecology programs.
Although digital data has existed for decades, management of those datasets is only now becoming a matter of discussion among researchers. Also lagging is a plethora of educational materials related to data management. There are, however, some resources available for instructors interested in incorporating data management in their curricula. These resources are being generated primarily by institutional libraries (e.g., UC Berkeley Libraries 2011), discipline-specific organizations (e.g., education modules from The Federation of Earth Science Information Partners, 2012; wiki.esipfed.org), and large funded initiatives (e.g., education modules from DataONE, 2012; http://www.dataone.org). There are also data-based exercises being created by organizations such as the Ecological Society of America in their publication Teaching Issues and Experiments in Ecology (2012; tiee.esa.org), which integrates overarching ecological and scientific concepts with real datasets and their analysis.
An editorial in Nature (2009) summarized the problem best:
“Universities and individual disciplines need to undertake a vigorous programme of education and outreach about data. Consider, for example, that most university science students get a reasonably good grounding in statistics. But their studies rarely include anything about information management—a discipline that encompasses the entire life cycle of data, from how they are acquired and stored to how they are organized, retrieved and maintained over time. That needs to change: data management should be woven into every course in science, as one of the foundations of knowledge.”
There is evidence in the education literature that university-level students would benefit from thinking about data management and organization earlier in their science education rather than later. In fact, both the Benchmarks for Science Literacy (American Association for the Advancement of Science 1993) and the National Science Education Standards (National Committee on Science Education Standards and Assessment, National Research Council 1996) cite data collection, organization, and analysis as crucial pieces in the education of K–12 students; it logically follows that university-level students in their first or second year should be instructed on the next steps related to data collection, which is proper handling of those data. Leonard (2002) stated that “Being able to make accurate observations, predictions, collect and organize data and make inferences are among the most basic of such skills … for the average citizen who is trying to participate or to survive in a technological society”. It is therefore not sufficient to merely understand how to collect data; part of understanding how science works is being able to also organize (i.e., manage) those data.
To this end, we used an in-depth survey of ecology instructors at US institutions likely to be training future ecologists. We queried instructors about their institutions, the ecology course they teach, their beliefs about the importance of data management education, and their personal practices related to data stewardship. Overall, we found data management education to be deficient in the institutions we surveyed. Barriers identified by instructors were primarily associated with a lack of time, but instructors also cited lack of resources, lack of knowledge, and their belief that data management is not an appropriate topic for the level of students they teach.