Establishing an early indicator for data sharing and reuse

Funders, publishers, scholarly societies, universities, and other stakeholders need to be able to track the impact of programs and policies designed to advance data sharing and reuse. With the launch of the NIH data management and sharing policy in 2023, establishing a pre‐policy baseline of sharing and reuse activity is critical for the biological and biomedical community. Toward this goal, we tested the utility of mentions of research resources, databases, and repositories (RDRs) as a proxy measurement of data sharing and reuse. We captured and processed text from Methods sections of open access biological and biomedical research articles published in 2020 and 2021 and made available in PubMed Central. We used natural language processing to identify text strings to measure RDR mentions. In this article, we demonstrate our methodology, provide normalized baseline data sharing and reuse activity in this community, and highlight actions authors and publishers can take to encourage data sharing and reuse practices.


INTRODUCTION
Over the last 20 years, there has been a growing recognition of the benefits of sharing and reuse of research data: enhancing research transparency, supporting rigour and reproducibility, promoting innovation, and maximizing the economic return on investment of research funding (Vasilevsky et al., 2013: Beagrie & Houghton, 2014;Menke et al., 2022;Starr et al., 2015).Most researchers want to share and reuse data but do not have the time, resources, motivation, or know-how to do so (Hahnel et al., 2020), and rates of data sharing and reuse vary widely among researchers in the biological and biomedical sciences (Park, 2022).The NIH Data Management and Sharing policy (National Institutes of Health, 2020) furthers the requirements for the adoption of data sharing and reuse by the biological and biomedical research community.
Scholarly societies play a vital role in promoting and enabling data sharing and reuse among researchers (Maienschein et al., 2018;Ruediger et al., 2022).The Federation of American Societies for Experimental Biology (FASEB) recently launched DataWorks!(FASEB, 2021), a suite of programs designed to promote, enable, and reward a culture of data sharing and reuse across the biological and biomedical sciences.In 2022, FASEB publications similarly started to require authors to provide data availability statements and require data citation as an initial step on the path to encouraging data sharing and reuse.To assess the impact of these programs, FASEB identified a need to establish a baseline and to monitor changes in data sharing and reuse over time.
A major structural challenge has been how to measure such adoption of data sharing and reuse practice.One option that has been reported separately is to examine data availability statements.
During the time period of our study, about 20% of biomedical preprints and published works included such a statement, and very few described openly available data (McGuinness & Sheppard, 2021).
Another option that has also been explored is to examine citation of data in the reference list of research articles (Parsons et al., 2019).Authors may cite data they collected or data they obtained from another source and reused.Data citation standards have been developed and there has been a concerted attempt to align standards and policies (Altman & Borgman, 2015;Cousijn et al., 2019;Data citation principles, 2016;Hrynaszkiewicz et al., 2020).For example, researchers may deposit their data sets into a repository and obtain a unique identifier (DOI) to enable citation and discovery.DataCite Event Data can be used to track citation of those data sets (DataCite, 2022).However, while data citation infrastructure exists, the adoption of data citation practices is just emerging in the life sciences (Robinson-García et al., 2016).Researchers are starting to deposit their data in repositories, and the implementation of citation practices by publishers is only just emerging (Cousijn et al., 2018).While we would have liked to measure data sharing and reuse using DataCite Event data to track data citations, either directly or through a service such as Scholix (Burton et al., 2017), this approach is not presently feasible (Khan et al., 2020).Illustrating this lag, an August 2022 query using the DataCite Event Data API 1 showed that there were 5,854 DOIs registered in 2020 with DataCite of type 'data set' with at least one citation, the majority of which were registered post-publication by repositories including disciplinary preprint servers and university repositories showcasing faculty works (not by publishers).By comparison, the entire 2020 DataCite Event set had over a million citations, over 95% of which were associated with a single repository.The recent launch of the Open Global Data Citation Corpus by DataCite and partners, which will include DOI and non-DOI data citations will go a long way toward addressing these issues (Vierkant, 2023).
We therefore decided to test an alternative early indicator of data sharing and reuse that could be used to establish baselines and in the time when a more formal citation infrastructure is being adopted.Authors mention research resources, databases, and repositories (RDRs) in the Methods section of journal articles (Park et al., 2016), and there has been some work to track data sharing and reuse practices using a combination of both formal citations and informal references to data within the text of a publication (Park & Wolfram, 2017).RDRs are collated and curated data outputs from many research studies, and include bibliographic databases like Cochrane Library and Psy-chInfo; reagent databases like ATCC and AddGene; research databases like Ensebl and Pfam; and research software databases and repositories like Cytoscape and MaxQuant.Our hypothesis is that, if we cannot yet measure citation of an individual data set, maybe we can start to understand the potential of data citation infrastructure by measuring RDR citations in their stead.
We describe an approach to measuring biomedical data sharing and reuse that uses tools to mine free text for RDR mentions combined with the SciCrunch database of biological and biomedical research resources used and continually developed by the RRID project (Bandrowski et al., 2015).We present the methodology and descriptive statistics, and discuss the utility and limitations of the approach for assessing the volume of RDR mentions overall as well as more granular measures by resource type, journal, or discipline.

METHODS
After determining that journal article reference lists are not yet a feasible source for data citations, we decided to focus on text analysis of Methods sections.While authors may list research resources in other sections of a paper, we decided to focus on Methods to reduce the possibility of a false positive if an author were to mention a resource that is not used in the context of the research reported.We obtained Methods text for biological and biomedical journal articles from articles indexed in PubMed and available in the PubMed Central Open Access subset (NLM, 2022) for the years 2020 and 2021.For the purposes of this study, mineable text is dependent on both the licence of the publication as well as whether its journal uses a standard markup language (JATS, the Journal Article Tag Suite) so that sections of the publication are marked and thereby easily queried (see, e.g., Mietchen, 2015).According to EuropePMC, in 2020 there were 1,638,399 articles published and 625,338 (38%) have a Methods section available to text mine ('free to read and use').We identified a discrete subset of SciCrunch RDRs to include in this project.We reviewed the top 1,000 entries in the SciCrunch database, measured by citations, removed entries for organizations (such as universities without a corresponding RDR) or non-relevant tools (such as reference managers), updated links, and consolidated duplicates resulting from RDR mergers and name variations.The resulting list of 737 RDRs is shown in Table S1.
We used harvesting processes to extract RDR mentions based on the RRID initiative methodology (Bandrowski et al., 2015).We also harvested mentions of the URL or name of an RDR listed in the SciCrunch database as described in Ozyurt et al. (2016).This data set was augmented by articles in PubMed Central but not the OA subset in which RRIDs were entered by authors during the journal publication process.To ensure integrity of the harvested data, we performed statistical tests to determine if the RRID citations are consistent with algorithm-found citations.We manually viewed and removed inaccurate outliers, then statistically adjusted the rate of use.

RESULTS
From the mined Methods text, we extracted RDR mentions and created a unique association between an RDR (represented by an RRID number) and an article where the repository was mentioned (PMID number).For each pair we built a record that con-  1).The data set of all records is available in Table S2.

RDR mentions
We performed a descriptive analysis of the RDRs mentioned to better understand if there are specific journals, research fields, or RDR types that are more frequently mentioned.Overall, the distribution of RDR mentions is a long-tail type of distribution: most articles refer to a relatively small group of RDRs, while most RDRs are mentioned relatively infrequently (Fig. 1).
The most frequently mentioned RDRs are shown in Table 2, together with the discipline, RDR type, number of record-pairs, and list rank for 2020 and 2021.The ten most-mentioned RDRs covered over half of all mentions, and the 20 most-mentioned covered 65% of all mentions.
Of the top 20 mentioned RDRs, nearly half were specialized for genomics and a quarter each for clinical research and proteomics (Fig. 2, top).RDR mentions were also clustered by type, fairly equally between research databases, bibliographic databases, research software and repository resources, reagent resources, and research repositories (Fig. 2, bottom).

Journals with RDR mentions
We analysed RDR mentions from a journal perspective by calculating the number of all articles published in a journal mentioning at least one RDR.This approach yielded record-pairs from 3,312 journals.The distribution of journals with RDR mentions is also a long-tail type, meaning that most mentions come from a relatively small number of journals, while most journals refer only to a few RDRs.Reviewing total RDR mentions, 20 journals in the data set covered 32% of all recordpairs and the top 200 journals covered 71% of all record-pairs.This data can be skewed by journals publishing a large number of articles.
We then normalized RDR mentions by journal output and other variables, to explore which journals have the highest  3.All source data can be found in Table S2.

DISCUSSION
Our results show that mining Methods text of journal articles for RDR mentions is not only feasible, it also provides useful information that can help the community encourage and measure early-stage adoption of data sharing and reuse practices.While data sharing and reuse are not universally adopted, we show the practice is further along across the broad biological and biomedical sciences literature than DataCite or citation practices might indicate.First, using this methodology we show that authors are already engaged in using RDRs, and quantify this activity by RDR type and research area.
If researchers are provided more information about how to share and reuse data-as well as more workflows to capture data mentions-we can expect more authors to mention data and RDRs in their articles.
We describe a methodology for an early indicator that can be used until data citation practices are more widely adopted in the biological and biomedical community that would enable practical application of tools such as Scholix.Measuring the impact of interventions including FASEB DataWorks!community engagement combined with journal author guidance and funder policies are all necessary components in the goal of increasing research data sharing and reuse practices.
sists of: • RRID of the RDR, name of the RDR ('record-pair').• PubMed Identifier (PMID) of the article, title of the publication, DOI, date of publication, and the snippet (the relevant portion of the author's sentence describing the repository).• Title of the journal, journal ID, journal ISSN, and/or journal ESSN.The resulting 2020 data set consists of 95,430 unique record-pairs; 66,187 unique articles; and 616 unique RDRs; the 2021 data set consists of 110,048 unique record-pairs, 75,532 unique articles, and 619 unique RDRs (Table

TABLE 1
Article data set available to mine and research resources, databases, and repositories (RDR) mentions harvested.
FIGURE 1 Unique research resources, databases, and repositories (RDR) mentions in 2020, shown as a percent of record-pairs by range of total mentions.The number above each column represents the number of repositories in each range.FIGURE 2 Distribution of research resources, databases, and repositories (RDRs) in the top 20 mentions by research area (top) and type (bottom).Research databases: databases containing aggregated structured research data; for example, databases of ontological (GO annotations) or positional gene annotations (ClinVar), raw microarray data files (GEO), microscopic images (Allen Brain Atlas), and facts pulled from published work such as affinity values for drug target interactions (UNIPROT); Research repositories: archives for research data files that is usually supplementary to a paper or other scholarly work such as Figshare, Dryad, or Mendeley data.Bibliographic databases: databases containing primarily scientific articles, research reviews, preprints, legal documents such as patents, standards, and other long text documents; for example PubMed, Scopus, Google Scholar.Research software repository resources: databases that are registries or repositories of compiled software, software code, coding tools, research analysis tools, software versioning systems, and code archiving tools, such as GitHub, PyPI, or Elixir's bio.tools.Reagent resources: databases that serve primarily information about wet-lab and consumable research resources, such as Cellosaurus, ATCC, or the Antibody Registry.www.learned-publishing.org © 2023 The Authors.Learned Publishing published by John Wiley & Sons Ltd on behalf of ALPSP.Learned Publishing 2024; 37: 22-29

TABLE 3
Journals with the greatest percentage of mineable articles having at least one research resources, databases, and repositories (RDR) mention, 2020.