Metadata Made Easy: Develop and Use Domain‐Specific Metadata Schemes by following the dmdScheme approach

Abstract Metadata plays an essential role in the long‐term preservation, reuse, and interoperability of data. Nevertheless, creating useful metadata can be sufficiently difficult and weakly enough incentivized that many datasets may be accompanied by little or no metadata. One key challenge is, therefore, how to make metadata creation easier and more valuable. We present a solution that involves creating domain‐specific metadata schemes that are as complex as necessary and as simple as possible. These goals are achieved by co‐development between a metadata expert and the researchers (i.e., the data creators). The final product is a bespoke metadata scheme into which researchers can enter information (and validate it) via the simplest of interfaces: a web browser application and a spreadsheet. We provide the R package dmdScheme (dmdScheme: An R package for working with domain specific MetaData schemes (Version v0.9.22), 2019) for creating a template domain‐specific scheme. We describe how to create a domain‐specific scheme from this template, including the iterative co‐development process, and the simple methods for using the scheme, and simple methods for quality assessment, improvement, and validation. The process of developing a metadata scheme following the outlined approach was successful, resulting in a metadata scheme which is used for the data generated in our research group. The validation quickly identifies forgotten metadata, as well as inconsistent metadata, therefore improving the quality of the metadata. Multiple output formats are available, including XML. Making the provision of metadata easier while also ensuring high quality must be a priority for data curation initiatives. We show how both objectives are achieved by close collaboration between metadata experts and researchers to create domain‐specific schemes. A near‐future priority is to provide methods to interface domain‐specific schemes with general metadata schemes, such as the Ecological Metadata Language, to increase interoperability.


| INTRODUC TI ON
To define a kind of gold standard for data handling, Wilkinson et al. (2016) developed the FAIR data principles. These define principles to make the data Findable, Accessible, Interoperable, and Reusable and help to assess data handling workflows in regard to openness.
Widely reusable means that anyone making reasonable efforts could reuse the data and that this would be the case even if the data creator(s) are unavailable. "Anyone" includes the creator(s) of the data, other members of the creating research group, and any other researcher. Use cases include using data from previous experiments to plan new ones, reanalyzing data using different or new preprocessing or analytical approaches to either compare different methodologies (Dufour & Richard, 2019) or to address new scientific aspects (e.g., the use of trait databases Schneider et al. (2019)), metaanalyses (e.g., Culina et al., 2018;Zimmerman, 2008), reproduction of the studies, and use of data for teaching and training (e.g., Atenas and Havemann (2015); or Henty (2015)).
In order to being able to reuse data, it needs to be findable, it needs to be understandable why it was collected and how it was generated, it needs to be understandable which datasets are which, it needs to be understandable which variables contain what information, and relationships among variables must be specified (e.g., Gregory et al., 2019Gregory et al., , 2020Zimmerman, 2007Zimmerman, , 2008. All this information should be stored in metadata; thus, metadata are essential for reuse (Gregory et al., 2020;Zimmerman, 2007). Furthermore, interoperability (the I of FAIR) requires standardized metadata schemes.
Metadata schemes have been developed which aim at providing a standardized structure and vocabulary to be used when providing the metadata. Examples of these schemes are the (meta)data standard Darwin Core (Darwin Core task group, 2014, for the current version please see http://rs.tdwg.org/dwc/) and the metadata standard Ecological Metadata Language (short EML)  in the field of biology/ecology, or more broadly Dublin Core ('Dublin Core', 2020). Interoperability is essential for research that relies on combining different datasets and is particularly important for data-based interdisciplinary research as this very often combines data from different sources.
Given such important reasons for accompanying data with appropriate metadata, why do numerous datasets recently published not include useful metadata (Roche et al., 2015)? To have the metadata available requires the producer of the data to provide it. Therefore, the answer to the question of why many datasets are deposited without rich metadata is that the data creators have not prioritized creating rich metadata. There is some interest and some level of prioritization (e.g., Campbell et al. (2018) showed that especially early career researcher are participating in curating and sharing their data and metadata), but the uptake needs to be accelerated. A critical question that follows is how to motivate the creation and deposition of appropriate metadata. There are multiple possible answers; one that we focus upon is that creating metadata is not easy and creating metadata that conforms to a specific scheme is daunting and difficult for researchers. These schemes are relatively complex, as they are not specific to a research domain (see glossary for definition of "research domain"), but rather for a broad field. An example is the EML metadata scheme  which caters for earth and environmental science, while for domains in this field, not all properties of the EML scheme might be applicable. The advantages of being applicable to a broad field of science (e.g., consistent search across a range a wider range of domains, standardized property names, and vocabulary for metadata provision, interoperability) comes with the cost of being somewhat complex and rather difficult to understand, which could represent a significant barrier to use by research scientists not working in the field of metadata development.
Our aim was to make the processs of creating metadata not only easy, but also useful for the researcher that created the data and, if at all possible, a quite pleasurable experience to create. We follow the suggestion of Poisot et al. (2019), that domain-specific metadata schemes (small and purpose-built schemes) can be part of the solution to make ecological data easier to find and reuse. The example we use to illustrate a domain-specific metadata scheme is from the research domain we term "Experimental Microbial Ecology" (e.g. Worsfold et al., 2009;Pennekamp et al., 2017;Altermatt et al., 2015) (hereafter EME). We chose this domain because of our familiarity with it and the fact that the data involved can be quite complex.
Many measurements are often taken using different methods.
Multiple treatments are often applied. Numerous taxa are often involved. Various steps of data processing are required to obtain analysis-ready data (e.g., see Garnier et al., 2020;Pennekamp et al., 2017) from the measured raw data. The methods used can create large amounts of data (several terabytes). Therefore, EME is a sufficiently complex domain to be used as an illustration.
In this paper, we present as a case study the experience and results of our research group in developing the EME domain-specific metadata scheme. We first used the R package dmdScheme (Krug & Petchey, 2019a) to create a template domain-specific metadata scheme and then customized the template scheme to create the EME scheme (emeScheme (Krug & Petchey, 2019b). We end with a discussion on how these domain-specific metadata schemes can be integrated into larger metadata schemes by using the example of EML .
The content of this article focuses on presenting the approach by which a domain-specific metadata scheme can be created using

| RELE VANT FE ATURE S OF G OOD M E TA DATA S CH E M E S Standards:
A general consideration when developing domainspecific metadata schemes, has to be to prevent the proliferation of a multitude of schemes, risking little or no interoperability among domains. To increase interoperability, each domain-specific scheme should be as much as possible linked formally to standardized metadata schemes. A domain-specific metadata scheme can be an easy-to-use interface to a more general and standardized metadata scheme. The approach described in this paper, the dmdScheme approach, contains infrastructure which can facilitate this. Further aspects are discussed in detail below. Three other features of domain-specific metadata schemes can increase motivation of researchers to use them: co-development, ease of use, and data/ metadata validation.
Co-development by metadata experts and researchers in respective domains ensures that the scheme can be shaped by providing input to identify essential properties to be included in the metadata, and to exclude nonessential metadata. The goal then is to create a domain-specific metadata scheme that fits that domain.
Co-development not only results in a better product, but the resulting "ownership" of these schemes by researchers is likely to increase motivation to use them, to advertise them, to provide input for further development, and to include them in teaching and training.
Easy metadata entry is highly desirable. It should not be technically difficult, and presumably the easier the better. To accomplish these design goals, we made a metadata entry system that includes only a web browser-based application and a spreadsheet. The simplicity of these interfaces should keep the additional workload for the researchers as small as possible. Moreover, these methods of metadata entry can be common across domains, meaning that it is not necessary to teach or learn a different tool for each domain.
Previously developed applications for easy metadata entry include Morpho, a data management tool for earth, environmental, and ecological scientists (https://knb.ecoin forma tics.org/tools/ morpho).
It is not maintained anymore and has not seen any activity for the last 5 years (https://github.com/NCEAS/ morpho). Nevertheless, it is open source and could be developed further by all interested parties. Unfortunately, we did not manage run it, presumably due to incompatibilities with the java versions required that we were unable to resolve. Therefore, we were not able to compare its feature set with the here presented approach.
Validation of data and metadata can help researchers increase the quality of their data and metadata, for example, by checking that variables in datasets contain the information they should and that they correspond to the stated experimental treatment and observations. Most large metadata schemes provide mechanisms for validating the metadata (e.g., EML in the R package EML ). These validations assess mainly the syntactical correctness of the metadata, for example, if all required fields are provided and if numerical values are in the allowed range (if ranges are specified). More detailed (contextual and contentual) validation can be provided for more specific situations or for smaller domains of research, that is, for domain-specific metadata schemes.
The aim of a domain-specific metadata scheme would be to fulfill all of these four features. Nevertheless, in some cases it will not be possible to fulfill all without compromises which are not acceptable for the aim of developing specific schemes. This becomes apparent when considering a domain for which very specific use metadata is needed which cannot be linked to any larger metadata scheme. In this case, one should aim at linking the discovery metadata to a general scheme while keeping the use metadata unlinked. If both can be mapped, the domain-specific metadata scheme would be a frontend to provide metadata following a larger metadata standard, using terminology the researchers are familiar with.

| The Template dm dScheme PACK AG E
The R package dmdScheme (Krug & Petchey, 2019a) forms the core of developing and using domain-specific metadata schemes following the dmdScheme approach. It is normally hidden for the researcher/user of the domain-specific metadata schemes and mainly of concern for the actual developer or power user of new metadata schemes.
The package contains all the base functionality needed to develop a new domain-specific metadata scheme. It includes functionality to create a spreadsheet for entering the domain-specific metadata, functionality to read the metadata from that spreadsheet, basic validation functions, and export functions to xml and templates needed to implement the export to EML. It is important to note, that the dmdScheme package itself should not be used to enter actual metadata. It only contains a template for a metadata scheme.
How to develop a new scheme and how to use the package is explained in detail in the accompanying vignette Develop and Use the dmdScheme which is included in the supplemental material of this article.
A second part of the dmdScheme approach is a repository of domain-specific schemes (Krug, 2020). Here, any developed domainspecific schemes can be deposited. The R package dmdScheme contains functionality to load the selected scheme from this repository and installs the accompanying R package in a temporary library. This arrangement makes it possible to use the scheme not only together with the R package dmdScheme, but also in other programming languages, if so desired.

| Creating the emeScheme
The scheme emeScheme (Krug & Petchey, 2019b) was developed based on the dmdScheme (Krug & Petchey, 2019a) and is tailored for data from Experimental Microbial Ecology. The motivation to develop this metadata scheme was born out of the realization that for long-term storage and retrieval following the FAIR data principles, metadata and data format standards are needed to be able to find and retrieve the data at any later stage and to be able to reuse it, even in the own research environment. Therefore, it was decided to develop a rich metadata scheme which would provide enough metadata to be able to find the data and to reuse it.
As discussed in the Introduction section, interoperability across domains requires common cross-domain metadata schemes. The dmdScheme package already contains the basic structures to provide an export to EML xml format. But one of the basic requirements of doing so is linking of the domain-specific metadata properties to, in the case of the emeScheme, the EML properties. Hence, considerations in the drafting of the emeScheme (Krug & Petchey, 2019b) and some additional constraints (i.e., only one measurement and extraction method per data file), make it possible to translate the emeScheme metadata into EML (The export into EML is planned for the next major release of the emeScheme package.).
If in the development of a new dmdScheme the larger metadata scheme is kept in mind, it is possible to use all the functionalities of the package dmdScheme as a frontend for providing metadata which is compliant with larger, more complex, metadata schemes. In the same way, other large metadata schemes could be used as the framework for the domain-specific metadata schemes. This would bridge the gap between simple to understand domain-specific metadata schemes on the one side and complex and difficult to understand but applicable to a large range of different domains metadata schemes.
An open exchange between the researchers and a programmer developing the scheme was essential in turning the emeScheme into a domain-specific metadata scheme which will be used by researchers to create their metadata. Researchers were involved in the process of developing the emeScheme from the beginning. This included regular meetings to identify properties in the scheme which are missing, redundant, or not needed. Finally, the researchers were the first testers of the metadata scheme.
The process of developing the emeScheme involves the following steps: 1. Definition of objectives by researchers and developer. This included the objective of FAIR compliance, but also ease of use and validation functionality.

| Enhancing the validation
Even though the package dmdScheme already contains a validation function, the validation is generic and mainly structural. The same applies for the export to xml, which only exports to a single xml file.
Additional functionality in the emeScheme, that is, the contextual and contentual validation and the export of the metadata into one xml file per data file, is included in an accompanying emeScheme R package (Krug & Petchey, 2019b).
Validation means the checking of the internal consistency of the metadata, compliance with the allowed and suggested values and types of the metadata as well as against the structure of the actual data files. This validation produces an html (see Figure 2), docx, or pdf report, which shows errors, warnings, or notes. Errors, warnings, and notes represent different levels of severity of detected faults or inconsistencies in the metadata. For example, if a value is not in the list of allowed values, it will result in an error, while if it is not in the list of suggested values, a note will be produced. The validation in the emeScheme package includes aspects not incorporated in the mainly structural and syntactical validation in the dmdScheme package. Therefore, it was necessary to write a new validation function to add the new validation rules, that is, the validation of the structural metadata which concerns the data files and its columns.
When the validation has completed without errors, the metadata can be exported to one xml file per data file. As in the package dmd-Scheme, the export to xml creates a single xml file, and we needed one xml file per data file, a new export function was included in the accompanying R package.
F I G U R E 1 Two example sheets (Experiment and Species) in the emeScheme metadata file of the 'emeScheme' spreadsheet. The complete spreadsheet can be found in the supplemental material 'emeScheme.xlsx' | 9179 KRUG and PETCHEY

| Using the emeScheme
The functionality in the emeScheme, actually of albftool dmdScheme derived metadata schemes, can be accessed by any of three approaches. As the scheme (and the accompanying R package) can be uploaded to the scheme repository (Krug, 2020), they are usable from a universal web app (Krug, 2019) (Figure 3). Each time the web app is started, it reloads a list of available scheme packages (and their accompanying R packages), and these can then be used in the app.
Even though this approach is the easiest, it requires the uploading of the metadata as well as the data to the server for validation.
This might not be feasible because of confidentiality/privacy reasons or because of the large size of the data files. In this case, the app can also be launched from a local R session. The app then runs on the local computer and data never leaves the local computer.
As a third option, the emeScheme and all dmdScheme derived packages can also be used from the R command line.

G loss ar y
• Analysisprocessing the analysis-ready data in order to address the research question.
• Analysis-ready datadata ready for analysis; may be "ready" for a limited set of analyses. An example would be abundance of each of the species in a set of communities (e.g., population dynamic data of ecological communities). (Contrast with raw data.) • Data deposit packagea collection of data and metadata files deposited in a long-term repository. This consists of at least one data file and the rich metadata describing the data file(s) and associated information. May often contain multiple data files, each with its own metadata file.
• Discovery metadata-metadata which is useful for finding/ discovering the data. This includes for example bibliometric metadata. It can also contain information about the species and location. In specialized repositories, this metadata can be more complex and contain more properties (e.g. GBIF (2020) which uses the EML metadata scheme ), than for example in Zenodo (2020) which is using a general metadata scheme (DataCite, 2021). Discovery metadata should be indexed and be available to a search engine. The scheme describing this metadata is usually given by the repository.
• Domain/research domaina grouping of e.g. experiments, research, and/or questions addressed, whose data sets can be described using metadata following one metadata scheme which can be regarded as rich metadata. One example is "Experimental Microbial Ecology" for which the metadata scheme emeScheme (Krug & Petchey, 2019b) was developed. Fields, such as Ecology and Evolutionary Biology, contain numerous domains.
• Domain-specific metadata schemea metadata scheme for a domain.
• Field-specific metadata schemea metadata scheme general and broad enough to apply to an entire field. For example the Ecological Metadata Scheme (EML) (Jones et al., 2019).

F I G U R E 2
An example of the validation report. The full validation report is in the supplemental material file 'Validation_Scheme.pdf' • Long-term storage/preservationthe process of having data stored/preserved and accessible for the long-term (i.e., greater than 20 years envisaged).
The Zenodo repository currently has plans defined for at least 20 years of operation.
• Metadatadata about data. Metadata can be as little as the name of a variable/column in a spreadsheet of data, though such limited metadata would likely not be considered rich metadata and may not make the data FAIR. Metadata can be assigned two nonexclusive aspects, namely discovery metadata and use metadata.
• Metadata schemea formalized description of the metadata to be included in, for example, a data deposit package, their formats, and which ones are compulsory or not. A formal scheme assists with the indexing of the metadata that is required for programmatic searching and extracting metadata and data from repositories.
• Preprocessingthe preparation of the raw data to make it analysis-ready. This should be done by a script to make the process reproducible and may use different parameters/methods which need to be adjusted based on the research question and the raw data.
• Raw datadata as provided by the measuring device. This could be images or videos taken from a camera, tables as returned from machines or hand-written records.
• Rich metadatadefined by the Research Data Alliance (Research Data Alliance, 2017) as "data with enough accurate and relevant attributes to make it easily findable." • Use metadata-metadata which is useful/essential to be able to (re)use the data. In its most basic form, this is information containing the column names and description of the data files. It should also contain information about the experimental layout, approach, and data. This metadata can be described either by the metadata scheme used by the repository (GBIF (2020) uses the EML metadata scheme  which includes use metadata) or as an additional metadata file as defined by, for example, a domain-specific metadata scheme. These data do not have to be indexed.

ACK N OWLED G M ENTS
We have to thank all members of the Predictive Ecology Group at the University of Zurich who provided input in the development of the emeScheme and functioned as guinea pigs in developing and testing this approach. Funding was provided by the SNF Project

310030_188431, and University Research Priority Programme
Global Change and Biodiversity. Finally, we have to thank the three anonymous reviewers whose input provided encouragement and constructive critique to improve the manuscript.

CO N FLI C T O F I NTE R E S T
The authors declare that they have no conflict of interest.

O PEN R E S E A RCH BA D G E S
This article has earned an Open Materials Badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The data is available at https://doi.org/10.5281/ zenodo.3894237 and https://doi.org/10.5281/zenodo.4529180s.

DATA AVA I L A B I L I T Y S TAT E M E N T
The package does not use any data. The code is available as fol-