Harmonizing, annotating and sharing data in biodiversity–ecosystem functioning research

Authors


Correspondence author. E-mail: nadrowski@uni-leipzig.de

Summary

  1. The integrative research field of biodiversity–ecosystem functioning (BEF) requires close collaboration between researchers from different disciplines working on different scales in time, space as well as taxon resolution. Data can describe anything from abiotic ecosystem components, to organisms, parts of organisms, genetic information or element stocks and flows. Researchers prefer the convenience of spreadsheets for data preparation, which can lead to isolated data sets that are diverse in structure and follow diverging naming conventions.

  2. BEFdata (https://github.com/befdata/befdata) is a new, open source web platform for the upload, validation and storage of data from a formatted Excel workbook. Metadata can be downloaded in Ecological Metadata Language (EML). BEFdata allows the harmonization of naming conventions by generating category lists from the primary data, which can be reviewed and managed via the Excel workbook or directly on the platform. BEFdata provides a secure environment during ongoing analysis; project members can only access primary data from other researchers after the acceptance of a data request.

  3. Due to its generic database schema, BEFdata platforms can be used for any research domain working with tabular data. It supports the compilation of coherent data sets at the level of the primary data, allowing researchers to explicitly model correlation structures across data sets for synthesis. The EML export enables efficient publishing of data in global repositories.

Introduction

In biodiversity–ecosystem functioning (BEF) research, both the predictor – biodiversity – and the dependant variables – ecosystem services and functions – represent complex concepts. The data needed to establish BEF relationships are themselves highly heterogeneous and are typically generated by collaborative, interdisciplinary research consortia assembling expertise from various disciplines ranging from molecular ecology to remote sensing (Michener & Jones 2012). The diversity of data structures and scientific disciplines pose significant challenges when merging data sets to perform overarching meta-analyses. Here, we introduce the BEFdata platform that allows researchers to manage naming conventions between data sets and to import metadata and primary data from the same spreadsheet. It includes a transparent data sharing mechanism for cooperative research projects. We use a generic data structure to accommodate the complexity of BEF research, which makes our approach useful to other scientific disciplines.

In the following, we review the challenges of managing complex data, including (1) the heterogeneity of data structures, (2) the need to manage naming conventions at the primary data level and (3) the need for transparent data sharing mechanisms.

The transdisciplinary nature, as well as the range of spatial and temporal scales typical of BEF research, is reflected in the complexity of BEF data sets. They may describe the properties of soil layers, plant traits, occurrences of individual organisms, parts of individual organisms possibly at the molecular level or aggregated properties of conceptual entities such as vegetation layers or ecosystem matter pools. Additionally, the majority of data sets are human-entered containing less than 1000 rows (Heidorn 2008; Lotz et al. 2012), each with a unique data structure. Researchers prefer to use spreadsheets to prepare their data for analysis (Tenopir et al. 2011), but without proper annotation, even simple data sets can be difficult to understand.

When data sets are prepared independently in each research project, it is often easier to generate new names for physical or conceptual objects, than to work with names developed by other groups. Examples of such naming conventions are the codes given to plots, species names, individual IDs or categorical parameter values. Diverging naming conventions increase the effort required to harmonize data sets a posteriori. One way of promoting data harmonization is by prescribing fixed data structures that enforce the use of naming conventions. For example, the Diversity Workbench (Triebel 2012) offers validation against many different web services, including services for scientific species names, habitat types, institutions or geographic context. However, these represent only a small subset of the data resulting from BEF research. Another approach is to allow any type of data file to be uploaded but ensure that detailed metadata is included in a standard form. For example, Metacat (KNB 2010) uses the Ecological Metadata Language (EML) format (Fegraus et al. 2005). See Hernández-Ernst et al. (2008) for a review of ecological information standards.

Data requests called ‘paper proposals’ or ‘proposals’ are often used within cooperative research projects as a way to make data exchange more transparent, to help in attributing credit to data contributors and to increase trust and team spirit (Stokstad 2011). They are formulated research ideas that specify what data are needed and whose expertise should be consulted to answer a specific question. Cooperative research projects that use paper proposals include, for example, the TRY initiative (Kattge et al. 2011a), BEF-China (this article) and the Nutrition Network (Stokstad 2011). To our knowledge, there are no data management solutions that offer paper proposal mechanisms to share data sets and protocol data exchange.

BEFdata platform

The ‘BEFdata’ platform (Fig. 1) was developed within the Biodiversity-Ecosystem Functioning Research Unit of the German Research Foundation (BEF-China, http://www.bef-china.de, FOR 891). BEFdata is an open source web application written in Ruby on Rails (Ruby, Thomas & Heinemeier Hansson 2011) and PostgreSQL (PostgreSQL Global Development Group 2012). During upload, the data are harmonized against existing data sets at the primary data level. We use a generic data structure in that we store all primary data in a single ‘sheetcells’ table (Kattge et al. 2011b, Appendix 1). BEFdata provides an EML metadata export (Appendix 2). A detailed user manual is provided in Appendix 3. Information on setting up and managing the platform can be found online (https://github.com/befdata/befdata). BEFdata platforms are currently implemented by the BEF-China project (http://china.befdata.biow.uni-leipzig.de) and its Chinese partner projects (http://159.226.89.107) and by the FunDivEUROPE project (http://fundiv.befdata.biow.uni-leipzig.de).

Figure 1.

Screenshots of the welcome pages of the BEF-China group (http://china.befdata.biow.uni-leipzig.de), its Chinese partner projects (http://159.226.89.107) and the FunDivEUROPE (http://fundiv.befdata.biow.uni-leipzig.de) BEFdata platforms. Data sets and paper proposals are grouped by projects (A), by user (B) and on a separate data view (C). Primary data as well as metadata are uploaded exclusively through a formatted Excel 2003 workbook (Appendix 4) to minimize user interaction with web forms. For a user manual, see Appendix 3 or the BEFdata code repository (https://github.com/befdata/befdata).

Data harmonization

BEFdata platforms use a bottom-up approach to developing naming conventions driven by the data. Primary data are uploaded from the import workbook (Appendix 4). BEFdata platforms currently support text, date, number and category data types; each type has its own validation rules. Original import values are stored in the ‘sheetcells’ table and are not altered thereafter. A separate ‘categories’ table enables adherence to naming conventions across data sets (Appendix 1). Data columns from the import data are assigned to data groups, and the upload process ensures that categories are unique within data groups. Primary data of number, date and category data types are matched to existing categories within their data group during upload (Fig. 2). Having different categories available for numeric data allows the explicit definition of missing data values. Invalid values are flagged to the user for manual checking. See the user manual in Appendix 3 for further information.

Figure 2.

Data group and category pages of a BEFdata platform. Categories are unique within data groups, and a data group page lists all its categories (A). During data import, primary data are matched to existing categories. Each category links to its own page (B), listing all the primary data it is associated with (C), including their original import values (D). Administrators can rename or merge categories from the data group pages (E) and split categories from the category page (B). See the text and the user manual in Appendix 3 for further information on how to manage categories and data groups.

The bottom-up approach to naming conventions requires a level of data management, which would not be needed when using fixed naming conventions. Categories and data groups can be browsed by members and managed by data owners and administrators (Fig. 2). All the categories in the data groups are listed on the individual data group page. Each category also has its own page that lists all the primary data linked to the category and the original uploaded value.

Administrators can rename, merge and split categories on the platform (Fig. 2). Any changes are reflected in every data set that is linked to the category.

Data owners can edit their data sets and reassign data groups, which restarts the validation process. The data owner can also download the workbook at any point. Any invalid categories will be highlighted in the downloaded file, and any missing or invalid data can be corrected in the workbook and the workbook re-uploaded.

Data sharing workflow: paper proposals

Access to data sets is restricted to the data owners. Members who would like to use data sets for analysis must submit a paper proposal, which contains a list of the data sets. Data sets can be added to a logged-in user's cart, and this collection of data sets can then be used for a paper proposal. The proposal is initially reviewed by a project board to make sure that it is novel, complementary and does not compete with other activities, and then by all data set owners listed in the proposal. Once all owners have approved the proposal, proponents gain download access to the requested data sets.

Discussion

BEFdata platforms allow the harmonization of both primary data and metadata for collaborating research projects. In comparison with initiatives that concentrate on managing data set metadata (for example, Metacat, KNB 2010 or BExIS, Lotz et al. 2012), the focus of BEFdata is on the primary data and specifically naming conventions within primary data. Having complex but consistent sets of primary data offers new possibilities for analysing ecosystem functioning. Current approaches to interdisciplinary synthesis in BEF research compare the regression slopes from separate analyses (Balvanera et al. 2006; Nadrowski, Wirth & Scherer-Lorenzen 2010; Maestre et al. 2012). Consistent data sets enable synthesis at the level of the primary data where the correlation structure of data points can be modelled explicitly using hierarchical modelling techniques (Ogle et al. 2007).

The categories of BEFdata platforms are not controlled vocabularies (NISO 2005). While homonyms can be resolved because categories are nested within data groups, it is not possible to specify narrower or broader terms or to flag synonyms. However, BEFdata can make the use of existing semantic tools easier: custom naming conventions are exposed on a common platform where they can be reviewed; data and metadata are stored in one relational database, enabling seamless data and metadata interrogation; and metadata can be exported in standard EML format. A logical further step is to implement data validation against existing web services or thesauri (Nadrowski et al. 2012). The possibility of using web services to exchange information between repositories will be the subject of future BEFdata development. We are additionally evaluating the integration of BEFdata platforms into Kepler workflows (Gries & Porter 2011; Pfaff, Nadrowski & Wirth 2012).

Conclusions

BEFdata platforms are communication tools that help researchers in cooperative research projects speak the same language using shared naming conventions, while having the convenience of working with spreadsheets. Our implementation of the paper proposal process makes the data use more transparent, which can increase synergies in cooperative research programs. Global data visibility can lead to new scientific collaborations, and data can be exported in EML format. BEFdata platforms do not contain prescribed domain logic and can thus be used by any scientific domain working with tabular data. With this, we hope to make data management and reuse within cooperating research projects more efficient and enjoyable.

Current managers of BEFdata platforms have profited from the speed of installation and customization (1 to 3 days for managers unexperienced with rails applications). They continue to profit from bug fixes and new features added to the common code repository. Initial feedback from the current users has been positive. Researchers have found it especially helpful to be able to extract automatically assembled lists of names across data sets for species or plots.

Acknowledgements

This manuscript was greatly improved by the comments of two anonymous reviewers. The authors wish to thank all the members of the BEF-China project for essential help and feedback in crafting the functionality of the BEFdata platform. K. N, M. P., D. S., K. W, S. W. were supported by the German Science Foundation (DFG) through the BEF-China project (FOR 981, sub-project ‘Data management’) of C.W and H.B., and S. R. was supported by the EU project FunDivEUROPE (265171, Work package 1, Task I.4 ‘Data management, data quality assessment and control’) of C. W.

Ancillary