Research data infrastructures in environmental sciences—Challenges and implementation approaches focusing on the integration of software components

Research data infrastructures are quickly evolving and show a wide variety, for instance in the way they address user requirements and use cases as well as how they provide user‐required information through their software architecture. In this article, we discuss challenges and provide approaches to developing software architectures and software components for research data infrastructures in environmental sciences. Taking the GeoKur research project on harmonizing land use time series as the use case for curation and quality assurance of environmental research data, we designed, implemented, and tested approaches and software components with particular regard to data management planning as well as provenance and quality information management. We aim to illustrate how to better meet researchers’ needs and provide tightly interlinked software components.

increase in activities on Research Data Management (RDM) and in the development of research data infrastructures (RDI).Having started primarily with a focus on data provision, these activities now provide several generic as well as domain-specific components to support all phases and aspects of the research data lifecycle (Figure 1, green circle).The research data lifecycle illustrates stages of data management and describes phases from data collection to data reuse (Kowalczyk, 2017).It is used as a model to support data management guidance and is well-known and accepted across different data-oriented disciplines.In the GeoKur (https:// geokur.geo.tu-dresd en.de/ ) project (Fischer et al., 2023), we took the use case of creating global land use time series data to evaluate, integrate, and test curation and quality assurance approaches along the research data lifecycle.We therefore focus on both curation-driven and use case-driven aspects.GeoKur combines the requirements, experience, and skills of data providers, data users, and software developers.Figure 1 also summarizes and groups the aspects for discovery and documentation that we identified as most relevant (outer circle in gray): when creating a harmonized high-quality land use time series, data providers tackle different challenges, for example, to harmonize land use terms of input data (by using controlled vocabularies/ontologies) or to describe the data provenance and the quality of the resulting data product.Next, when evaluating the relevance of datasets as potential input data, data users endeavor to evaluate how the datasets have been used in other use cases (Figure 1, usage documentation).Fischer et al. (2023) focus on the user perspective and describe their approach to analyze the availability and accessibility of metadata and to collect user requirements.Moreover, they developed the conceptual framework to co-develop tools and produce and use data that we build upon.
Numerous standards and initiatives supporting harmonization and interoperability for metadata and data, providing unique identifiers for datasets and controlled vocabularies as well as linked data approaches to enable data F I G U R E 1 Aspects of research data management along the typical phases of the research data lifecycle for the use case of global land use data.Source: https:// geokur.geo.tu-dresd en.de/ .
Environmental research and earth system sciences are clearly among the pioneers in data-intensive sciences and the implementation of digital research processes.This encompasses the entire research process: from observing, recording, and continuously measuring environmental phenomena with different sensor systems, to analyzing these observational data and building models and simulations to understand or predict the earth system, synthesizing data from different sources to describe and assess environmental changes, and publishing the results as scientific articles, recommendations, or interactive visualizations.Thus, a plethora of-partly open source and/ or community-driven-software tools, repositories, interoperability standards, legal frameworks, and data infrastructure initiatives established in the field of environmental and earth system sciences as well as in the field of public and commercial spatio-temporal information (Bailo et al., 2020;Bernard et al., 2014;Coetzee et al., 2020;Dangermond & Goodchild, 2019;Huber et al., 2021;Peng et al., 2021).Attempts on RDM and RDI for environmental sciences were undertaken in several initiatives, like the ENVRI (https:// envri.eu/ ) initiative, the ENVRIFAIR (https:// envri.eu/ home-envri -fair/ ) project, EPOS (https:// www.epos-eu.org) (European Plate Observing System) and ICOS (https:// www.icos-cp.eu).
This article builds on numerous experiences in setting up different RDI projects and related findings from developing software architecture models and implementing research data management components.Taking the GeoKur 1 project on the curation and quality assurance of environmental research data as a reference, we provide concepts, approaches, pilots, and recommendations on different levels.RDM is a broad field and also covers governance, legal, and ethical as well as financial aspects that will not be addressed here.Focusing on data provenance and data quality information as crucial aspects of research data management, we hereby synthesize our findings into a conceptual sketch of typical elements of RDI, summarize related challenges, and provide a conceptual framework for a related technology stack to address upcoming issues and improve research data management for environmental sciences.

| CORE SOF T WARE COMP ONENTS AND INFORMATI ON RE SOURCE T YPE S WITHIN RE S E ARCH DATA INFR A S TRUC TURE S
RDI are diverse as they focus on different target groups and offer (digital) research products of different resource types.However, they usually provide an aggregated and curated set of research datasets and offer consolidated services to access and manage these datasets (Magagna, Goldfarb, et al., 2020).In data-intensive research projects, researchers typically start by elaborating a plan for data management, which includes the specification of all data-related workflows, formats, quality aspects, responsibilities, and roles as well as related software aspects (see e.g.Science Europe, 2021).To guide researchers and data managers through the creation of such Data Management Plans (DMP), several tools have been developed (Jones et al., 2020).
DMP tools, such as DMPonline (https:// dmpon line.dcc.ac.uk/ ) or RDMO (https:// rdmor ganis er.github.io/ ), facilitate the web-based management of DMP and provide essential questionnaires to support the definition of data management plans.These DMP tools can be seen as core components of research data infrastructures, supporting the planning for and the subsequent documentation of research data management.
The DMS also serve as metadata catalogs to discover published data, whereas current versions also allow for the discovery of other digital information resource types (e.g., map services and processing tools).To foster FAIR data and workflows, researchers need to describe their data and workflow scripts with metadata in the DMS, ideally linking to a software code repository, such as GitLab (https:// about.gitlab.com/ ), or related web (processing) services for data analysis.By simultaneously implementing several metadata profiles, for example, for scientific datasets, analysis workflows, and related scientific publications or documents, a centralized management, common tagging, and linking of available research results is fostered by using common vocabularies.
A knowledge hub enables managing, providing, and linking domain knowledge, such as thematic ontologies, vocabularies, and registers in research projects, institutions 2 , or for specific disciplines. 3Such a hub provides several entry points and views to the managed information.Data management systems and archives link to the hub to support meaningful descriptions for the provided data (and processing scripts), thus facilitating discovery and data integration.A knowledge hub can be implemented as a triplestore to provide access to the knowledge as linked data and enable linking to external knowledge hubs.
As the knowledge hub mainly provides a machine-readable interface, domain-specific tools mainly comprise tools that provide user interfaces.They are developed for (domain-) specific purposes and enable users to explore or process specific metadata or data, for example, visualize data cubes as an interactive 3D component.Domainspecific tools are mainly provided as web components linked from and to information websites, which serve as (visual) entry points.
The core components listed above can be either configured for a specific research project on dedicated servers, managed by the related research organization to be used for all projects of the organization, or used as an online instance for several organizations and projects.
F I G U R E 2 Core components of RDI including software components and information resources.
In addition to the core components, RDI can include tools and scripts to foster automation, such as automatically extracting metadata from geodata, providing mass imports to facilitate publishing common relevant metadata or tracking provenance information from scripts.These supporting tools, sometimes implemented by the researchers themselves, cover smaller tasks in researchers' daily routines.
Several open-source software packages are available to implement the RDI core components specified above.
However, integrating a selection of these software components into a suitable and easy-to-use RDI serving the needs of researchers along the whole research data workflow remains very challenging.In the following sections, we will address this challenge by providing integration approaches and pilot implementations.We consider data management plans, provenance information management, and quality information management as three exemplary and thematically linked topics addressing specific data user and provider needs and ways to link and integrate software solutions.

| APPROACHE S AND PILOTS TO INTEG R ATE CORE SOF T WARE COMP ONENTS OF RE S E ARCH DATA INFR A S TRUC TURE S
In the project GeoKur (https:// geokur.geo.tu-dresd en.de/ ) (Fischer et al., 2023), we focussed on an integrated approach considering the topic-related problems across software components and phases of the data lifecycle from the user perspective and developed an architecture (Figure 3) based on standards and open-source software that will be described in detail in the following sections.We selected and developed specific tools with particular F I G U R E 3 Overview of the RDI software components in GeoKur-* components relevant for DMP, ° components for provenance information tracking, management, and visualization, 'components for quality information extraction, management, and visualization.respect to the core components for data management, DMP tools, supporting tools, domain-specific tools, and the knowledge hub as set out in Figure 2.

| Guidance, templates, and tools for data management plans
Data management plans (DMP) are formal and structured documents that capture all relevant information along the data lifecycle and include for instance specifications of data products or workflows, agreements on documentation, storage, access, or roles (Science Europe, 2021).DMP are evolving and can be seen as living documents, used and improved collaboratively along the data lifecycle (Henzen et al., 2021;Miksa et al., 2019).The purpose of DMP differs due to administrative requirements specified by funding institutions and project settings, for example, on the outline or number of pages.Possible purposes of a DMP are: • To provide detailed information on data processing for allocating resources from an HPC center.
• To describe data management agreements for smaller (intern) projects to better plan resources and increase awareness for unknown aspects and practices, for example, using structured and standardized metadata to improve findability.
• To define roles and specifying workflows for (complex) multi-or interdisciplinary research projects with a heterogeneous team to develop a common understanding of data management.
Current DMP challenges address researcher support-in particular concerning a lack of domain-specific guidance and DMP templates-and tool-based support (Miksa et al., 2022;Netscher et al., 2022).
However, in both cases-addressing the template and the guidance-the interdisciplinary/multidisciplinary character of most research projects plays a major role and all DMP approaches should be compatible with several disciplines, which stands in contradiction to the need for disciplinary solutions and improvements.
DMP have until now been typically managed in specific DMP tools and (in some cases) made publicly available via DMP catalogs (https:// dmpto ol.org/ public_ plans or https:// liber europe.eu/ worki ng-group/ resea rch-data-manag ement/ plans/ ) or on common sharing platforms. 4This includes efforts in providing machineactionable DMP (Miksa et al., 2022) and in making DMP findable, accessible, interoperable, and reusable (FAIR).
However, researchers currently need to manage relevant data management information in separate tools, for example, DMP tools, data management systems, and project management tools.Reusing information, such as the data descriptions required in most DMP templates and metadata profiles for datasets, for further DMP or further activities in the lifecycle, is rarely supported via tools and often leads to the inefficient copying or updating of information in multiple tools.Thus, researchers need more efficient tool support as well as linked tools to enable efficient management and reuse of information throughout the entire research data lifecycle.

| Fostering multidisciplinary DMP guidance
DMP are relevant for all domains.Approaches should therefore be usable or transferable to all domains.We therefore developed a multidisciplinary DMP guidance framework, a domain-specific DMP template using standardized metadata, and a related approach on linking a DMP tool and a data management system for updating and reusing information.All activities are driven by our experiences in guiding our project partners to write DMP (see Egli et al., 2022) and in our recent activities in several multidisciplinary working groups. 5  Our framework provides and maps discipline-specific guidance and consequences to the five Science Europe categories 6 for DMP by applying a pattern approach.We herewith facilitate researchers in (1) choosing specific guidance for several requirements and disciplines, (2) managing and linking guidance for other disciplines and (3) tagging guidance with common principles, for example, FAIR or openness.Further, we follow the approach on separating guidance from DMP managing tools by facilitating linkage to guidance registers, thus enabling a more efficient (centralized) guidance update across tools.
Specific guidance for the environmental sciences (ES) mainly focuses on data description, collection, documentation, and data quality, as provided by the two following guidance examples for the documentation and data quality core requirement covering metadata elements and standards.Table 1 shows two examples for disciplinary guidance based on our experiences, empirical studies, and a survey in the GeoKur project.
As a proof of concept for our pattern framework, we guided a team of data stewards and researchers with different disciplinary backgrounds to apply the pattern framework for their disciplines-resulting in an initial set of similarly structured disciplinary DMP guidance fostering discovery and usage for multidisciplinary projects (Henzen et al., 2022).
The guidance patterns can be linked from within DMP templates and thus facilitate writing the DMP.However, to provide additional guidance in the DMP template and to facilitate interoperability between software components, we developed an ES-specific DMP template.
3.1.2| Providing a structured DMP template to facilitate guidance, information reuse, and software linking Our DMP template follows an approach providing structured fields instead of free texts to facilitate machineactionable reuse of information and to improve guidance for DMP writers.Starting from the Science Europe core requirements, we adapted the data description section by including mandatory fields from our project-specific GeoDCAT (https:// inspi re.ec.europa.eu/ good-pract ice/ geodc at-ap) metadata profile for geospatial datasets (Henzen et al., 2022), which extends GeoDCAT with respect to provenance and data quality information.
We then implemented the developed template and an export plugin to push the metadata from the DMP tool to our data management system, such that they foster automation and facilitate their working smoothly and seamlessly during planning and subsequent phases, for example, updating the DMP during several phases of the data lifecycle (Figure 4).The developed export plugin (Wagner et al., 2022) for the web-based DMP tool Research Data Management Organizer (RDMO) (https:// rdmor ganis er.github.io/ ) uses the CKAN API (https:// docs.ckan.org/ en/2.9/ api/ ) to create or update metadata entries in our data management system CKAN (Figure 3, indicated by *).

| Providing metadata profiles and tools
Metadata management is core to support discovery and reuse of datasets, and ultimately to allow reproducibility of the research findings in data-driven, heterogeneous, and multidisciplinary research projects that are common in Environmental Sciences.Thus, ensuring the acquisition and provision of meaningful and quality-assured metadata should become an integral part of such projects.Requirements for a suitable metadata schema differ and available metadata schemas often do not fully meet these requirements.Here, choosing a suitable metadata schema and developing a proper metadata profile is a relevant task at the beginning of each project.
Current metadata challenges include efforts in making metadata FAIR and multidisciplinary approaches that enable researchers to synthesize scientific workflows across disciplines (Jeffery et al., 2022;Latif et al., 2021;Martin et al., 2020).By defining metadata profiles that meet the requirements of several disciplines, researchers are able to link to existing meta-information instead of creating yet another specific profile.Providing a common set of metadata fields for discovery in multidisciplinary repositories still seems to be in conflict with providing rich disciplinary detailed metadata to facilitate reusing or reproducing findings.This is of particular interest when TA B L E 1 Examples for disciplinary ES-specific guidance for the documentation and data quality section of the Science Europe DMP template to answer question (2a) What metadata and documentation will accompany the data?If not applying a disciplinary metadata schema or related profile, information may be lost From a technical perspective, wellestablished and well-defined schemas or profiles shall be used to foster interoperability of repositories, for example, to support harvesting metadata from a project repository into an institutional repository or long-term archive publishing/harvesting (project-or domain-) specific metadata into institutional or long-term repositories and tackling information loss.
While current FAIR data approaches, such as the FAIR metrics developed in the FAIRSFAIR (https:// www.fairs fair.eu/ fairs fair-data-objec t-asses sment -metri cs-reque st-comments) project, already address the description of dataset provenance, providing structured, harmonized, and detailed FAIR data quality is still a pressing challenge (Peng et al., 2021).Moreover, comprehensive approaches considering provenance and data quality as two key aspects of metadata to facilitate discovery, evaluation of fitness for use and reuse are still lacking.We therefore provide linked and harmonized approaches for both aspects by taking up current approaches and software developments from the perspectives of data users and data providers.

| Implementing provenance approaches in various RDI tools
Provenance information of a dataset describes its genesis and if the dataset is referenced in the provenance of another dataset, it can be used to describe its future use.Depending on the level of detail, this description can include processes, parameters, datasets, and actors that were part of the dataset's generation (or will be ).However, mapping approaches between existing provenance models exist (Closa et al., 2017;Jiang et al., 2018).ProvONE (https:// purl.datao ne.org/ provo ne-v1-dev) extends PROV and allows the detailed description of computational workflows, for example, related programs, ports, and channels.This technical and workflow system-oriented description is too detailed for use cases like our GeoKur about land use change, where data providers just want to describe the applied conceptual methods to process the data instead of runnable workflows.Ensuring the availability of detailed provenance information in widely accepted models (e.g., PROV-DM) is fundamental for RDI, as it fosters research data transparency, understandability, replicability, and reproducibility (Magagna, Martin, et al., 2020;Sheeba & König-Ries, 2022).It furthermore is an integral part of fitness for use analysis of the datasets available in the RDI.Nevertheless, structured metadata Dataset description using the GeoDCAT properties in the section "Data description and collection or reuse of existing data" of the Science Europe template implemented as an RDMO questionnaire (Wagner et al., 2022).
for processes, managed together with dataset metadata is often lacking.This insufficient documentation of processes and datasets was identified by researchers with particular interest in the reproducibility of data creation processes (Baker, 2016;Lemos et al., 2012;Ostermann et al., 2021).Moreover, researchers lack support in generating and updating provenance information directly from the analysis workflows, which hampers efforts to produce detailed provenance information.Some attempts have been made, for instance, by developing APIs (Spinuso et al., 2022).However, data users then require user-friendly interactive visualization tools that facilitate the understanding and evaluation of complex provenance graphs.
In our GeoKur data management approach, we aim to support researchers in overcoming these issues (1) by providing meaningful metadata profiles for both-datasets and processes-and enabling linkage between the generated metadata sets, ( 2) by providing suitable supporting tools to create and manage proper metadata from researchers' scientific workflow scripts fostering the publication in a DMS and (3) by incorporating this information in a knowledge hub, and thus generating a queryable provenance graph that serves as input for our interactive user-friendly domain-specific tools facilitating the evaluation of datasets' fitness for use (Figure 3, indicated by °).

| Provenance modeling
To meet user and software needs, we implemented two linked provenance approaches-a relational database approach for our data management system and a graph database approach for our knowledge hub.When considering all available provenance information managed in an RDI as the RDI's provenance graph, we model workflows that describe the creation of a specific dataset as subsets of the RDI's provenance graph.Thus, workflows serve as containers grouping several datasets and processes with a strong semantic relationship and enable the identification of specific datasets and processes as their parts.A process can only be part of one specific workflow and can be assigned input and output data.Input data, output data, and the process are modeled as an elementary provenance unit.Figure 5 shows an example of three workflows and their related provenance units (example highlighted in bold) as implemented in our data management system.In our catalog, we provide user-friendly structured metadata views for the workflows showing all related datasets and processes, for the processes showing all related input and output datasets, and for the datasets showing input datasets and enable navigating to the related dataset or process metadata view (Figure 6).However, datasets can be part of several workflows (Figure 5, dataset D4 is used in workflow W1 and W3).These relations across workflows cannot be efficiently visualized/ used for navigation in a catalog, because this requires a script that requests all stored metadata from the catalog and iteratively checks for relations between the provenance units.That leads to semantic information loss when using the catalog.We therefore also used the PROV-Ontology (PROV-O 7 ) to implement the graph database (triplestore-based) approach, which facilitates linking and querying across workflows in a semantic data graph (Figure 6 and modeled datasets as entities, processes as activities, and authors of datasets as agents).

| Provenance tracking
In our metadata management approach, we provide two options for tracking provenance as supporting tools.The straightforward option is to use the web-based CKAN user interface to fill the specific metadata profile fields.

F I G U R E 5
Relational database provenance approach applied to nine datasets (rectangle) and six processes (rhombuses) that are part of three workflows.Workflow W3 describes an elementary provenance unit with (two) inputs D4 and D7, a process P5, and an output dataset (D8).Workflow W2 and W1 include two units that are linked by one or two datasets, for example, units in W2 create D6 as output and use it as input for P4.
With particular respect to provenance, we provide used and generated input fields for processes to choose proper in-and output datasets.For datasets, we provide contact point fields to link PROV agents.
In the second option for tracking provenance, we leverage the CKAN API to manage metadata directly from the analysis workflows to foster automation.In GeoKur, data producers mostly implement analysis workflows using the programming language R. The R package ckanr (https:// cran.r-proje ct.org/ packa ge= ckanr ) is a client Metadata description for the MapSPAM workflow with links to related datasets, output, and processes implemented in the data management system CKAN.
for the CKAN API that provides (R-)user-friendly access to CKAN instances through API requests.To guide developers on how to use ckanr for our metadata profiles, we provide comprehensive best practice documentation (https:// github.com/ Geoin forma tionS ystems/ Guida nce_ Docum ents/ tree/ master/ CKAN/ api_ from_ r) and an F I G U R E 7 Semantic data graph with links across workflows after mapping relational database to provenance information graph (see Figure 5).
Using CKAN and the related API to track provenance is suitable for dataset-level granularity of provenance.
However, to support provenance tracking for an arbitrary level of granularity, we developed the prototypical packages provr for R (https:// github.com/ Geoin forma tionS ystems/ provr ) and provo for Python (https:// github.com/ Geoin forma tionS ystems/ provo ).The packages implement the PROV-O terms as classes and class methods and provide simple type checks for the method arguments, for example, the method wasDerivedFrom requires an argument of type entity.A provenance graph is created by object instantiations and class methods calls.This graph can be serialized in any RDF format at any point in the script.

| Provenance visualization
Provenance graphs can vary in size.Users often have difficulty evaluating complex provenance graphs using catalog-based user interfaces as these user interfaces separate provenance information over several pages for dataset and process information.On the other hand, triplestores often do not provide user interfaces and require knowing how to create queries.Our domain-specific tool ProvViewer (https:// geoku r-dmp2.geo.tu-dresd en.de/ provV iewer/ ) implements an interactive provenance visualization concept for entities (datasets) and activities (processes) that allows user-based interactions, such as zooming, dragging, or expanding the visualized provenance graph or parts.
We therefore support data users who want to evaluate provenance information for datasets, but are unfamiliar with the PROV data model or querying approaches, by enabling the viewer to be opened from the CKAN.Users with semantic/coding skills can provide a link to a SPARQL endpoint or upload an RDF file and then select an entity for the initial visualization.The ProvViewer takes any PROV-O compliant RDF graph as input (triplestore data).3.2.2| Modeling user-driven quality information, workflows, and related tools Geospatial research data is increasingly used for impact research and decision-making, which demands the usage and application of high-quality methods and standards throughout all phases of the research data lifecycle (RfII, 2020).Quality information is essential for the evaluation of a research dataset's fitness for use, in particular, when integrating and harmonizing data from diverse sources (Sun et al., 2019), such as is often the case for processing ES data.However, data users still fail at describing quality requirements and data producers fail to provide easy-to-use and comparable quality information (Ivanova et al., 2013).Moreover, information is often limited to the description of metadata quality and does not address the quality of the dataset, but acts as a crosscutting and hierarchical concept to describe if all required fields are filled correctly.In the end, quality information is currently often lacking, is limited to metadata quality, or lacks structured provision.There are many reasons for this, that is, lacking guidance, usable methods, and proper supporting tools for data producers to track or generate meaningful and structured quality information directly from the analysis workflow (cp.Magagna et al., 2020;Peng et al., 2021;Wagner & Henzen, 2022).Moreover, flexible and easy-to-use schemas for quality descriptions that meet data users' needs could be used to foster the provision of user-friendly user interfaces offering harmonized and structured FAIR quality information.
In GeoKur, we used interviews to collect quality information requirements from our project partners, who are data producers and/or data users.Based on these specific needs for our GeoKur use case (Egli et al., 2022), we conducted a survey to collect user needs systematically from the ES community (Egli et al., 2021).Summarizing the results, there is an overall need for detailed and structured (visualized) quality information to better evaluate and compare the fitness for use of datasets as quality information is often described in related scientific publications or on websites, but not in the metadata, and differs in granularity.Here, the term structured is used to describe metadata that is provided in separate fields/properties instead of sentences and the visualization aspects address the usage of suitable graphs or interactive tables (with filters).Moreover, data producers require efficient structures to manage metadata and to manage and reuse quality measure descriptions.
Based on these needs, we developed (1) a QA workflow to track and evaluate quality information along the data lifecycle and include maturity aspects, (2) a generic linked data approach for a dataset-specific modeling of data quality metadata and a quality measure register, (3) a supporting tool to extract quality information from geospatial datasets that can be used in the QA workflow and to get quality information for required metadata profile fields, and (4) a geospatial dashboard that visualizes quality and linked provenance metainformation (Figure 3, indicated by ').

| Approach for an ES-Specific QA workflow
Our quality assurance workflow is implemented along the research data lifecycle and links aspects such as openness, FAIRness, data quality, and data maturity-as the degree of formalization and standardization of data, as used in Höck et al. ( 2020) (Wagner & Henzen, 2022).We reused and defined quality, openness, and maturity levels to better describe and monitor the current status of a dataset within the phases of the research data lifecycle.By doing so, we link our DMP and data quality approaches as these aspects are typically described in DMP.Moreover, we build on this strong conceptual link by providing the same software component, here RDMO, for DMP and QA tasks with the same mechanisms for related questions, checklists, and tasks, fostering users in a familiar usage of the component and reducing the number of needed components.Moreover, we can reuse and adapt the linking approach for the DMP tool and the DMS to perform metadata updates from the QA workflow, which fosters the seamless integration of components from a user perspective.
To foster automation, in particular, to collect quality information that is currently not available for potentially suitable datasets and needed in the QA workflow, we developed a supporting tool for geospatial datasets (https:// github.com/ Geoin forma tionS ystems/ Metad ataFr omGeo data) that extracts metadata from several geospatial file types (Wagner et al., 2021).Depending on the file format of the input, the tool extracts several data quality metrics compliant with ISO 19157:2013 (International Organization for Standardization, 2013).

| A software architecture to implement a generic quality register
One of the findings of the previously mentioned survey was that data users and producers are aware of the availability of qualified and community-approved data quality indicators to describe geodata quality.However, a recent review of scientific literature found 110 unique data quality classifications (Haug, 2021).Moreover, accepted and used quality indicators and classifications also vary within the ES discipline, for example, leading to a semantic gap between data producers and data users (Yang et al., 2013;Zhang et al., 2019) Moreover, we developed an interactive domain-specific tool (https:// geoku r-dmp.geo.tu-dresd en.de/ quali tyregister) for the exploration of quality metrics, dimensions, categories, and their relations that are managed in the register.The graph-based visualization is automatically generated and updated when the register is updated and provides hovering, zooming, dragging, and functionalities for expanding and collapsing nodes, in our case the metrics, dimensions, and categories (Figure 9).The visualization tool is integrated into the CKAN DMS ensuring efficient user navigation between the quality register and linked dataset metadata.

| A linked data approach for dataset-specific Modeling of data quality metadata
If data producers want to use the register, they can look for a suitable quality metric to describe their dataset's quality characteristics in the visualization client and then describe the quality measurement by linking from the metadata of the dataset to the metric's entry in the register.However, in expert interviews and a survey (10.5281/ zenodo.7379019)within our project, data users and data producers express the need to describe a data quality measurement in greater detail.
Taking the use case on land cover and land use for small study areas, the experts require spatial, temporal, and thematical explicit quality information (that is representativity), they want to evaluate reference datasets or ground truth datasets, in the best case geospatial quality datasets, instead of textual descriptions.The experts need at least information on where to find more details on the quality metrics and measurements.To achieve this, we extended the DQV data quality measurement with the following additional but optional properties: 1.An option to provide a description of statistical confidence of a measured value (e.g., standard deviation).Such a confidence description is modeled as a DQV quality measurement of a statistical quality metric.
2. An option to link to a ground truth dataset against which the data quality measurement was performed (not only a textual description).
3. An option to describe the source of the quality information, which includes fields for a link to the source, the name of the source, and the type of source.
4. An option to define a spatial subset of the dataset's spatial extent for which the quality measurement is valid.
For instance, when the dataset's spatial extent is defined as the extent of South America, then the validity of a data quality measurement's results could be limited to the extent of Brazil.
5. An option to describe a temporal subset of the dataset's temporal coverage for which the quality measurement is valid.For instance, when the dataset's temporal coverage is defined as the years from 1990 to 2010, then the validity of a data quality measurement's results could be limited to the years from 2002 to 2007.
6.An option to define a thematic subset of the dataset's thematic scope for which the quality measurement is valid.For instance, when the dataset's thematic scope includes several land cover classes, then the validity of a data quality measurement's result could be limited to one of these classes.In the best case, this should be defined by linking to a proper vocabulary.
With properties ( 4)-( 6), we enable data providers to describe an inhomogeneous distribution of a dataset's data quality by applying the same metric multiple times with different spatial, temporal, and/or thematic constraints (Rümmler et al., 2022).
Providing detailed descriptions of data quality directly in the dataset's metadata holds benefits for data producers and users.By making data quality accessible as part of the metadata, we facilitate data producers to better highlight the quality of their datasets and to prevent misuse.Data users benefit from this approach of providing detailed and structured quality information in metadata by having the opportunity to include data quality requirements in their search queries when they are looking for suitable data in DMS.Moreover, the approach serves as the basis for developing user-friendly domain-specific tools, such as dashboards, that allow for integrated in-depth analysis of metadata.

F I G U R E 9
Graph-based visualization of our quality register content.Quality metrics are highlighted in green, dimensions in orange, and categories in blue.As a first set of quality metrics, we used the ISO19157 metrics and added metrics for our GeoKur use case.The blue circles can be expanded by clicking on them.

| A geodashboard to explore data provenance and data quality
The visualization of structured and harmonized metadata is crucial to support the evaluation and reusability of geospatial datasets (Ślusarski & Jurkiewicz, 2020).Although approaches to visualize quality and provenance information already exist, they lack solutions which harmonize or link specific metadata visualizations to facilitate the evaluation of geospatial data.Here, geodashboards allow for a quick and efficient overview of complex information, and thus, support decision-making (Bernasconi & Grandi, 2021;Grandi & Bernasconi, 2021).We therefore developed a geodashboard (https:// github.com/ Geoin forma tionS ystems/ Geoda shboard) as a domain-specific tool that follows user needs as described in Section 3.2.1 and combines proper linked visualizations for general metadata, provenance information, quality information, and geospatial data (Figure 10).By doing so, we facilitate users in answering complex questions across provenance, data quality, and datasets, for example, (1) How does a certain dataset (and process) influence a dataset's quality for a certain metric?Or (2) When comparing datasets, which of the input datasets provides better data related to a certain metric?
The Geodashboard includes an interactive provenance graph visualization for a specific dataset (Figure 10, No. 1), a tabular view with general metadata for a dataset selected in the provenance graph (No. 2), an advanced matrix or chart views for selected quality metrics and datasets (No. 3), and a map view (No. 4) (Figgemeier et al., 2021).
Based on a linked approach, the content of each of the four mentioned widgets dynamically adapts depending on the user's selection.
F I G U R E 1 0 User interface of the Geodashboard with the four different widgets provenance graph (1), metadata ( 2), quality information (3), and the map representation (4).Our Geodashboard also serves as an alternative, more sophisticated user interface component for the CKAN, providing a tabular view for each dataset's metadata entry and is implemented with open-source libraries, for example, Plotly (https:// plotly.com/ javas cript/ ) for the charts or Leaflet.js(https:// leafl etjs.com/ ) for the map.
Moreover, we use the GeoKur knowledge hub (the triplestore) as an information source and reuse the source code of the ProvViewer (see Section 3.2.1).

| D ISCUSS I ON AND CON CLUS I ON
Taking data management planning, managing provenance, and data quality information as exemplary research data management activities and as crucial components of RDI, we have provided several approaches on how to integrate existing software components to best-fit researchers' requirements and workflows.Taking the GeoKur use case, we first gathered specific user needs related to these activities and then successfully demonstrated how to integrate selected software components that meet those needs.
Envisioning the long-term perspective for RDM from this still lacks overarching strategies and a mechanism on how to best select from the plethora of existing software components to ideally (a) foster reuse and sustainability of these components and (b) provide adequate services for researchers to manage their (digital) research results and to best support the respective research workflows.In support of future work toward such strategies and frameworks, we synthesize our conclusions in the following set of recommendations: 1. Addressing researchers' needs: Often claimed, however easily forgotten in the course of implementations.
Domain expertise, provision of well-accepted user stories, careful modeling the researchers' processes, accommodating different RDM expertise and domain habits and vocabularies of specific research groups, regular user consultation and prioritizing user requirements as well as providing user-friendly and intuitive tools supported by usability studies are therefore crucial aspects of successful RDM and RDI developments and require adequate resources.Such usability studies may not only consider one specific RDM tool but should always evaluate usability along every step of a workflow-for example, from data management planning, to data exploration, data and metadata management, and data publishing-such that the overall consistency of the different involved user interfaces (e.g., concerning user guidance, terminology, etc.) is assured.
2. Support the uptake of RDM into researchers' routines: Once RDM activities become closely integrated into the workflows of researchers-land on their desks every day-they will become part of their routine and help to achieve the FAIR principles.From a technical perspective, this requires for instance close integration of metadata, quality, and provenance acquisition in the researchers' analysis tools, ideally in an automated manner.
Even if several approaches address this issue, it is still far from common practice, and thus, compiling metadata allowing data reuse still too often remains a tedious task at the end of a research study (if at all), once research results are published.Addressing these issues already in the phase of data management planning (see Section 3.1) in such a way that study-specific metadata acquisition mechanisms are available right from the beginning of a study.These should allow for pragmatic or at least semi-automatic metadata acquisition via APIs and appropriate ETL support tools instead of web forms requiring tedious manual metadata input.Additionally, the provision of intuitive and appealing tools for exploration and analysis of other researchers' data publications (see Section 3.2) may further ease the uptake of RDM aspects into researchers' daily routines and ideally also motivate them in their own FAIR research data publications.

Figure 2
Figure 2 provides an overview of typical software components and information resources in RDI that are helpful to researchers during several phases of the research data lifecycle.
part of the dataset's future use).The provenance data model (PROV-DM) (https:// www.w3.org/ TR/ prov-dm/ ) allows storing this information-the provenance graph-as semantic data, and thus, fosters interoperability (Moreau & Missier, 2013).In Earth System Sciences (ESS), the term provenance is often used synonymously with lineage, for example, when using the ISO 19115-1:2014 Lineage Elements to report provenance information.The ISO 19115-1:2014 Lineage model is less structured than the W3C Recommendation PROV-DM (International Organization for Standardization, 2014

Figure
Figure10shows our Geodashboard visualizing the MapSPAM example (see Section 3.2.1).The table(Figure 10,No. 2) shows general metadata for the dataset "GPW Gridded Population of the World v4.0" (2), the related quality information (3) of the selected datasets, and the data itself(4).The visualization as a heat matrix facilitates the comparison of the quality measures related to the different input datasets of the MapSPAM model.The color scales are initially set by the Geodashboard and can be changed by the users.

Guidance/explanation ES-specific guidance Potential consequences
For ES data, in addition to common metadata elements, for example, name and identifier, we propose providing theme/keyword, including links to domain ontologies and vocabularies (e.g., https:// www.fao.org/ agrov oc/ ), spatial resolution, extent/region, spatial reference system (e.g., the EPSG collection supported by the Open Geospatial Consortium (OGC): https:// epsg.org/ home.html), temporal coverage, temporal resolution, provenance, specific quality measures (e.g., provided by ISO19157-1 and ISO157-3, currently under revision or development), ground truth data for quality measures, and a concept of essential variables (e.g., proposed by NASA: https:// www.earth data.nasa.gov/ learn/ backg round ers/ essen tial-varia bles) for cross-domain topics/ quality Poor ES metadata could hamper discovery, comparison, and ultimately evaluation of fitness for use and reuse of available data Most repositories enable thematic, spatial, and temporal filtering to foster the discoverability of ES datasets and therefore require separate fields for themes/keywords, and spatial and temporal extent For the evaluation of a dataset's fitness for use, data quality and provenance information are essential (e.g., results of a data quality survey: 10.5281/zenodo.5562326,10.5281/ zenodo.5700952).By providing structured and linked metadata for quality and provenance, we facilitate researchers in reusing existing datasets . To foster a common understanding of data quality indicators among scientists, we developed a data quality register initially filled with a subset of ISO 19157:2013 (International Organization for Standardization, 2013) quality indicators as part of the project's knowledge hub.Users can propose new entries to the register's curator, who is responsible for reviewing and publishing the indicator descriptions.The register is implemented as part of the knowledge hub and uses the Data Quality Vocabulary (DQV) to model quality indicators as semantic data.DQV is also applied in GeoDCAT which we used for our metadata profile.DQV describes quality indicators as DQV dimensions, which can be grouped into DQV categories.A given DQV dimension, describing what is measured, can be linked to specific DQV metrics, describing how it is measured.For instance, the dimension Thematic Classification Correctness (https:// geoku r-dmp.geo.tu-dresd en.de/ quali ty-regis ter# thema ticCl assif icati onCor rectness), which is related to the category Thematic Accuracy, can be measured with the metrics Thematic Classification Correctness as Number of Incorrectly Classified Features or Thematic Classification Correctness as Kappa Coefficient (https:// geoku r-dmp.geo.tu-dresd en.de/ quali ty-regis ter# thema ticCl assif icati onCor rectn essAs Kappa Coeff icient).