A Guide to Using GitHub for Developing and Versioning Data Standards and Reporting Formats

Data standardization combined with descriptive metadata facilitate data reuse, which is the ultimate goal of the Findable, Accessible, Interoperable, and Reusable (FAIR) principles. Community data or metadata standards are increasingly created through an approach that emphasizes collaboration between various stakeholders. Such an approach requires platforms for collaboration on the development process that centers on sharing information and receiving feedback. Our objective in this study was to conduct a systematic review to identify data standards and reporting formats that use version control for developing data standards and to summarize common practices, particularly in earth and environmental sciences. Out of 108 data standards and reporting formats identified in our review, 32 used GitHub as the version control platform, and no other platforms were used. We found no universally accepted methodology for developing and publishing data standards. Many GitHub repositories did not use key features that could help developers to gather user feedback, or to create and revise standards that build on previous work. We provide guidance for community‐driven standard development and associated documentation on GitHub based on a systematic review of existing practices.

reporting formats promote data reusability (Hart et al., 2016;Pasquetto et al., 2017;Zimmerman, 2008) by enabling efficient integration and interpretation of similarly formatted research data and metadata (Read et al., 2013;US EPA, 2015;Yarmey & Baker, 2013). For convenience, here we use the term "standards" to refer to formal data and metadata standards, reporting formats, and numerous other related terms (Table 1).
Education and outreach with the research community can facilitate both the creation (Sansone et al., 2019) and early adoption of community-developed data standards, as has been successfully demonstrated in the biological sciences (De Pooter et al., 2017;Galdzicki et al., 2014;Wieczorek et al., 2012). Outreach can take many forms, including annual meetings and webinars, engaging diverse disciplines in discussion and testing, and leveraging open web platforms for potential users to preview the standard. Standards may evolve over time as user feedback is generated or new observational methodologies (e.g., sensor technology) necessitate modifications (Bezuidenhout, 2020).
For standards to be widely used by scientists, clear definitions and documentation are essential, which must describe current and past versions of the standards. Both standards and documentation can evolve with community input, which motivates the need for documenting data standards with systems that support versioning (i.e., detailed tracking of changes to multiple documents from one version to the next) and tracking user input. Such version control systems (VCSs) are typically used for collaborative software development, and there are many web-based hosting services for them (e.g., GitHub, BitBucket, GitLab, CodeCommit, and SourceForge). VCSs are also well suited for chronicling the collaborative development of data standards, like software, oriented around text-based documents (Mergel, 2015;Perkel, 2016;Schneider et al., 2019). A key feature of most VCS is transparency-direct and suggested changes to content are visible. Note that throughout this study, we use the term "VCS platform" as a shorthand for a modern web-hosted software collaboration environment that combines VCS systems with code browsing and editing, issue tracking, documentation, continuous integration, and other tools for enabling software development across teams.
With over 56 million users, GitHub is one of the most popular VCS platforms (GitHub, 2020b). In addition to software development, GitHub is increasingly used as a platform for collaboration on documents, such as standards that require versioning. The software coding community on GitHub has identified four best practices researchers can take to increase a GitHub repository's visibility and reusability. The first is to create a descriptive "README" file (Lee et al., 2021), written in the markdown coding language, which helps consistently format documents on GitHub. This is the homepage of a GitHub repository, and should provide the user with details like "What the project does" and "How users can get started with the project" (GitHub, 2020a). The second is to license the code (or more generally, content) within a GitHub repository to clearly and precisely specify any conditions attached to its use and reuse (Lee et al., 2021;Stoudt et al., 2021). The third is to use GitHub for collaboration by submitting issues (i.e., making a comment on a repository) or pull requests (editing a copy of repository content and then asking the owners to "pull" the changes into existing content; Bissyandé et al., 2013). GitHub repository owners may choose to describe their preferred methods for collaboration by creating a markdown document called CONTRIBUTING.md (Sholler et al., 2019). Finally, GitHub-based developers may create a project webpage using services like GitHub Pages (https://pages.github.com/) that mirror some or all content within a repository to an external project website (Angulo & Aktunc, 2019;Tantisuwankul et al., 2019).
ESS-DIVE (Environmental Systems Science Data Infrastructure for a Virtual Ecosystem) is the Department of Energy's (DOE) data repository for Environmental Systems Science (ESS) research (Varadharajan et al., 2019). Starting in 2019, the ESS-DIVE team partnered with domain experts to develop data reporting formats for a suite of data types ranging from file-level and CSV metadata to domain-specific, such as soil and leaf respiration and hydrological data. While developing the data reporting formats, the domain experts solicited extensive feedback from the communities who would ultimately supply and use the data. Therefore, the documentation system needed the capability to track rounds of community feedback across multiple documents and versions. Thus, we chose GitHub as a natural VCS platform for tracking changes, comments, and issues for the proposed reporting formats. In the process, we found that although version control allows for management and collaboration on standards development, there was a need for guidance on how to best leverage the broader collaborative features of VCS platforms such as GitHub for community-developed standards.
The overall objective of our research was to conduct a systematic review to inform the development of a community-driven approach for describing data standards using a VCS platform, with a focus on GitHub. Specifically, we sought to: (a) characterize the version-controlled documentation for existing data and metadata standards, (b) identify how managers of VCS websites ask users for feedback on standards, and (c) record whether repository managers build user-facing websites (in addition to their VCS site) for hosting version-controlled documents. In this study, we present the results of how 32 groups developing data standards and reporting formats have organized their GitHub repositories, and provide recommendations for structuring VCS for groups taking a community-driven approach to data standard and reporting format CRYSTAL-ORNELAS ET AL. We also provide a set of guidelines and example templates for managing data and metadata standards and reporting formats on GitHub and other VCS platforms.

Identifying VCS Repositories Used for Managing Data Standards
We conducted a systematic search for groups using VCS to document data standards, data reporting formats, and ontologies (hereafter, data standards; Table 1). In September 2020, we used FAIRsharing.org's data standard search tool (https://fairsharing.org/standards/; Sansone et al., 2019) to locate existing data standards.
We retained only data standards that FAIRsharing.org classified as "ready" rather than "in development." We further filtered the database to select standards associated with the following domains: earth science (n = 65), ecology (n = 34), and environmental science (n = 31). We identified an additional 24 standards that were not captured by our initial search but were recommended by domain experts (data used in this analysis are available in Crystal-Ornelas et al. [2021a]).

Selecting Relevant VCS Repositories
We identified 155 potentially relevant data standards for selection. First, we removed any duplicate data standards that appeared multiple times in our search results (n = 47 duplicates). Then, we retained only standards that use GitHub as the VCS platform for actively managing documentation. We choose GitHub because it was the predominant platform (n = 60 used GitHub out of n = 108 standards reviewed). In fact, it is notable that amongst all the standards reviewed, only one organization used a different VCS platform (BitBucket). Finally, we excluded GitHub repositories that were used for simply hosting binary or non-text files (e.g., MS Word document or MS Excel spreadsheets; n = 28 excluded), and thus included only groups that used GitHub for active management and collaboration on data standards (n = 32).

Characterizing Content Within GitHub Repositories
We visited each GitHub repository identified during our systematic search (Table S1), and characterized the documents and content within each repository according to five general topics: (a) contents of the entire GitHub repository, (b) README page content, (c) preferred methods for collaboration and receiving feedback, (d) labels for tracking issues within a repository, and (e) user-facing project websites.
To characterize the content on each GitHub repository and in README files (Topics 1 and 2), we developed a set of standardized terms for content (e.g., "about section" or "recommended citation"; See Table 2 for a full list of terms and definitions) based on a pilot screening of 10 GitHub repositories. Sometimes during the data collection process, we identified a new term (e.g., "code of conduct") not previously found during the screening process and added it to our list of terms. We analyzed repository-wide content and README content separately to identify if content was more often included as part of a repository's README file (i.e., the repository's homepage) or contained in sub-folders of a repository.
We then broadly reviewed the way that each GitHub repository suggested visitors contribute revisions or updates to their version-controlled documents (Topic 3). To do this, we categorized the preferred method of collaboration for each repository as (a) issue submissions, (b) pull requests (i.e., suggesting changes directly to content/documents), and issue submissions, or (c) unclear contribution method. We then conducted a more detailed characterization of content within repositories that supports user collaboration. Similar to the methods used for Topics 1 and 2, we created a set of standardized content terms related to contributing to GitHub repositories (e.g., "issue templates" or "GitHub tutorials;" See Table 3 for a full list of terms and definitions). Then we manually identified whether repositories included the content terms or not.
We used text analysis tools to identify any GitHub issue labels that were commonly used for tracking and organizing user-submitted GitHub issues. We carried out two steps to prepare the text (i.e., "issue labels").

10.1029/2021EA001797
First, we "tokenized" all labels using the python module re (Python Software Foundation, 2020) to create a ready-to-analyze list of all label text. Then, we "stemmed" each label, which removes suffixes to enable clearer grouping of words with similar stems. For example, the labels "reviewed" and "reviewing" would be CRYSTAL-ORNELAS ET AL.   stemmed to the root "review." Then, we counted the frequency of each stemmed label and further grouped stemmed labels by visual inspection where necessary.
Finally, we visited each GitHub repository to determine if repositories had separate project websites for those standards (e.g., https://environmentaldatainitiative.org/ or https://cfconventions.org/) and, if so, identified the service they used to create those websites. We also recorded if the version-controlled documentation was also stored in a long-term data repository (e.g., Zenodo, Dryad, and Figshare).

Results
Our systematic search located 108 data standards, guidelines, and reporting formats as well as ontologies in earth science, environmental science, and ecology. There was variety in the platforms used to manage standards (Table 4). In general, data standards were either hosted using GitHub (n = 60, 55%) or through the organization's website (n = 42, 39%). When the data standard documentation was hosted on an organization's website, the site often provided links to data standard documentation in PDF, RDF, or CSV format. Notably, out of the 108 data standards in our review, only 17 (18%) were published and stored in a recognized data repository.

Version Control Content
Most GitHub repositories (94%) convey general information on the data standard by using an "About" section (Figure 1a). Because all repositories in our review focused on data standards, reporting formats, or ontologies, the repositories often contained a file with descriptions of key terms and definitions of those terms (91%). Another frequently used documentation method was indicating the current version of the data standard using semantic versioning for tracking releases (e.g., v1.0.3; Preston-Werner, 2020). The current version of the data standard was often listed in the body of the repository's README file or by using the built-in "Release" widget (which in turn leverages the underlying Git VCS's "tag" mechanism) that is part of every GitHub repository home page.
Some less common elements of GitHub repositories include usage licenses (56%), recommended citations for the standard (31%), funding information (22%), and "Getting Started" sections (34%). Getting started content is typically different from "About" sections because it provides a table of contents to the GitHub repository complete with links and information on how to quickly make use of documents, folders, or templates within a repository. The patterns we found throughout GitHub repositories are generally mirrored in our analysis that focused on GitHub README files (Figure 1b). Note. Sample sizes indicate the number of data standards hosted on each of the platforms.

Ways to Contribute to Data Standards
Only 18 (56%) of the repositories in our review encourage contributions, and only 10 (31%) provide detailed instructions on how to suggest changes to the documentation (Figure 2). Seven repositories provided detailed guides for contributing using GitHub recommended CONTRIBUTING.md files within their repository's root folder, or in a ".github" folder. Three repositories provided step-by-step tutorials for making contributions.
Repository managers on GitHub asked users for feedback in several different ways. Of the 32 repositories that were part of our review, 9 (28%) suggest that users submit GitHub issues to make suggestions for revising data standards. It was more common (n = 13; 41%) for repositories to allow both pull requests and issue submissions. For 10 repositories (31%), it was unclear how the data standard developers wanted users to submit feedback.
We found 208 unique labels being used to track and prioritize user-submitted issues across all 32 repositories. The maximum number of labels used by a single repository was 28, while nine repositories did not add any additional labels beyond the default set provided by GitHub. The most frequently used custom terms included in the labels were "priority" (n = 13), "class" (n = 7), "docs" (n = 8), and "term" (n = 6). These labels were often paired with other words that gave the issue label additional context (e.g., high-priority or new-term-request).

User-Facing Websites
Eleven repositories managed and displayed their data standards only on GitHub (e.g., https://github.com/ESIPFed/science-on-schema.org/). The rest of the standards (n = 21) were hosted on other websites in addition to GitHub. Most often, repository managers used GitHub Pages (https:// pages.github.com/) to build a project website (n = 18) to mirror some or all of their documents and templates on a separate site (e.g., https://www. odm2.org/). The remaining three repositories built separate HTML-based sites (e.g., https://environmentaldatainitiative.org/) to display a subset of files hosted on their GitHub repositories.
10.1029/2021EA001797 7 of 13 Figure 1. (a) In our analysis of the content across 32 GitHub repositories we found that greater than 90% of the repositories included both an "about" section to describe the repository and a "terms" list that defines essential elements of the data standard. Relatively few repositories included a table of contents in the form of a "getting started" section (34%). Even fewer provided recommended citations for their work (31%) or funder information (22%). (b) All GitHub repositories in our systematic review contained a README file. Content of the README pages varied, but most contained an "about" section that described the data standard. Licensing information, suggested citations, and versioning details were described in approximately 30% of README pages.

Figure 2.
Just over half of the repositories in our review (56%) mention contributing to their data standards on GitHub. Fewer (n = 10) provide details on the multi-step process often involved in reviewing suggested changes to repository content all the way through publishing approved content.

Documenting ESS-DIVE's Data Reporting Formats on GitHub and on ESS-DIVE
We used the practices identified in our systematic review to create our ESS-DIVE Community Space on GitHub (https://github.com/ess-dive-community), where six teams of scientists are developing and managing data and metadata reporting formats (e.g., Bond-Lamberty et al., 2021;Damerow et al., 2021;Ely et al., 2021). We note that initial drafts of documentation were created and reviewed using other collaborative cloud-based tools (e.g., Google Sheets), then migrated to GitHub for community feedback.
To facilitate uploading data standard documentation to GitHub, we created README and GitHub Issue templates, complete with written prompts for content based on the findings of our systematic review (Templates are available for download in Crystal-Ornelas et al. [2021b]). For example, all README files include a "How to Contribute"' heading where each repository can link to the GitHub issue templates that help organize user feedback. In general, repositories begin with a flat file directory (i.e., with no subfolders) and then folders are created if a reporting format has several files of the same type (e.g., templates, images) within the repository. We use the documentation tool GitBook to render our GitHub repositories as project websites (e.g., https://ess-dive.gitbook.io/continuous-soil-respiration-reporting-format/). Content displayed on GitBooks are automatically updated with the most recent version of documents on GitHub, lessening the burden of keeping track of documents across multiple platforms for our repository managers.
When reporting formats are finalized, we use GitHub's "Release" feature to tag updates to the data standard documents with the semantic versioning schema MAJOR.MINOR.PATCH (Figure 3). The version numbers (e.g., v1.0.1) are assigned according to whether the changes to the formats are forward compatible, backwards compatible, or typo fixes, respectively. Once documents are tagged with a version number in GitHub, the documents can be easily downloaded and then archived in ESS-DIVE. The archived data package is issued a DOI and the resulting citation is manually updated in the GitHub repository README file.

Recommendations for Using GitHub to Develop Data Standards
Community-led data and metadata standard development and adoption require agreement across communities of researchers that work together to discuss, test, and update documentation. Based on the results of our systematic review and in light of well-established GitHub best practices from the coding community described in our introduction, we outline our recommendations for version control of data standard documentation using GitHub (Figure 3). Incorporating these recommendations may improve the usability of community-developed data standards, especially to engage scientists who are unfamiliar with GitHub yet essential for their contributions to the ongoing adoption of data standards. We note that many of our CRYSTAL-ORNELAS ET AL.

File Types
First, data standard developers need to decide which file types they will upload to GitHub to describe their data standards. The four main file types to choose from are binary files (e.g., Excel spreadsheets or Word documents), csv files, markdown files, and JSON or YAML files. One consideration when deciding which files to upload is that the GitHub user interface does not allow users to easily view changes (called GitHub diffs) between versions of some of the more human-readable file formats (e.g., Excel spreadsheets or column data such as CSV files). Markdown files have the benefit of being relatively easy to modify within the GitHub user interface, and changes to markdown files can be easily tracked using GitHub diffs. Finally, changes to JSON and YAML files will be shown clearly in GitHub diffs and can be incorporated into GitHub validation tools, but the learning curve to becoming familiar with these formats is steeper than all other file types.

README Files
Next, we recommend that GitHub repositories contain, at a minimum, a detailed README file in the repository root, in addition to domain-specific documentation for the data and metadata standards. This READ-ME file should include the following subheadings to organize content and support first-time users: "About," "Getting Started," "How to Contribute," "License," "Funding and Acknowledgments," and "Recommended Citation." Without this critical information, it is unclear how and whether data standards are able to be reused by scientists (README file template available in Crystal-Ornelas et al. [2021b]).

Licensing
To facilitate collaboration within a repository, we recommend that each repository include an open-source usage license. When first initializing a GitHub repository, users can choose from a set of usage licenses that can be autogenerated as part of the repository set-up process, and then modified to suit user needs. GitHub has also created a website where users can search for and select open-source licenses: https://choosealicense.com/.

Versioning
We recommend semantic versioning (e.g., v1.0.1) be used to track updates to version-controlled data standard documents (Preston-Werner, 2020). By using the built-in GitHub "Release" feature, repository managers can save a snapshot of their GitHub repository at a point in time and assign a version number that aligns with semantic version conventions ( Figure 3). Clear version numbers enable users to identify when they need to migrate their data to an updated standard or locate and download previous versions of the documentation. When repository managers choose to publish a "Release," we also recommend that they archive their data standard documents in a long-term data repository. If no domain-specific archive exists, then GitHub's integration with Zenodo can be used to instantly archive GitHub repository content and also generate a recommended citation. Some data standard developers may choose to version terms and vocabularies used within their data standard, separately from the standard itself (e.g., https://github.com/tdwg/vocab/blob/ master/vms/maintenance-specification.md or https://cfconventions.org/standard_name_rules.html). Decoupling vocabulary and data standard versioning can be an effective way to communicate with users when different aspects of the standard change (e.g., specific vocabulary terms vs. supporting documentation). As community data standards are used, tested, and feedback is generated, developers should be prepared for standards to be updated and changed over time. In addition to the semantic versioning described above, we recommend that changes to data standards be documented using one or more of the following approaches: listing the latest updates in the repository's README file, describing changes in local commits, providing pull request descriptions, referencing issue numbers in commit or pull request messages, or creating a GitHub changelog to provide details on data standard updates.

Issues and Contributions
We strongly recommend that collaborative development of community data standards take place through GitHub issues and pull requests rather than by other personal communications so that decisions and revisions are tracked over time, and publicly documented within a repository. Repository managers can create issue and pull requests templates to help structure user-submitted comments and edits. If data standard developers would like to include detailed contributing guidelines, we suggest creating CONTRIBUTING. md files in the root directory, which will also be indexed by the "community profile" checklist mentioned above, and linked on each issue template (https://docs.github.com/en/communities/setting-up-your-project-for-healthy-contributions/setting-guidelines-for-repository-contributors). We suggest that issues be categorized using either the set of built-in issue labels provided by GitHub or other labels specified by repository creators.

Documentation
We recommend that all GitHub repositories externally display the documentation on more general public-friendly project websites. In our review, we found that three of the repositories that built websites external to GitHub, did so using platforms that require manually updating content each time documents are updated. This type of long-term management of multiple documents across multiple web platforms is inefficient, error-prone, and unsustainable. Instead, we recommend using one of the many website-building platforms that seamlessly integrate with GitHub to mirror repository content on webpages and retrieve updates to documents automatically (e.g., GitHub Pages, GitBook, and netlify). For data standard developers creating machine-readable standards (e.g., CSV, JSON, or YAML files), many of the website building platforms can display these machine-readable formats in more human-readable tables (e.g., https://github. com/tdwg/camtrap-dp/blob/main/_layouts/tables.html). Website building and updating is just one of the many tasks that can be automated using a feature called "GitHub Actions" (https://docs.github.com/en/ actions). Advanced GitHub users may also consider using automated GitHub Actions to welcome contributors to their projects on GitHub or validate contributions from the user community.

Challenges and Future Directions
There are three main challenges to using GitHub as the primary platform for collaboration on data and metadata standard documentation. First, by design Git and thus GitHub does not support real-time collaboration on cloud-based files (e.g., google sheets). For the six teams developing data reporting formats with ESS-DIVE, the initial documents were generally drafted in word processing or spreadsheet tools before being uploaded to GitHub. The benefit of this approach is that many contributors, some unfamiliar with Git and GitHub, could directly edit documents and suggest changes. However, it means that the earliest phases of data standard development occurred outside of the VCS platform. One solution is to save the collaborative spreadsheet as a CSV file and when updates are made, upload the CSV to the GitHub repository, tag a new release of the data standard on GitHub, and close the related GitHub issue through a commit message.
A second, albeit relatively minor, limitation is that file types that are commonly used to create data standards (e.g., Excel spreadsheets, Word documents, and even CSV files) are not easy to edit within the GitHub user interface-either because they are proprietary binary formats (Excel/Word) or columnar by nature (CSV). Computer code and markdown files are, in comparison, easy to edit within GitHub and produce human-readable GitHub diffs. However, we note that if computer code or markdown files have hundreds of lines of changes between versions, users may want to view GitHub diffs using the desktop (https://desktop. github.com/) rather than the website version of GitHub. Binary files like word documents or spreadsheets must be edited offline, and then updated within a GitHub repository. This may deter contributions from users that want to view and edit documents in one location.
The third and perhaps most important challenge is related to the sometimes steep learning curve that must be overcome for scientists to feel comfortable and/or motivated to engage with content on GitHub (Isomöttönen & Cochez, 2014). Although GitHub and other organizations have developed educational tutorials geared toward first-time users (e.g., https://lab.github.com/ or Openscapes, 2021), VCS platforms in general, and GitHub specifically, remain focused on a programming user base. Thus, despite new user-friendly improvements, the GitHub learning curve can be steep for some non-computational researchers. Features like "GitHub Conversations" and in line commenting are examples where changes to GitHub's user interface can make somewhat complicated tasks (i.e., reviewing pull requests) more approachable. Moreover, creating project websites (i.e., using platforms like GitBook) let users unfamiliar with GitHub interact with documents through more human-readable websites. However, a key feature of version control is change logs that let visitors see how version-controlled content changes over time, and these change logs are not exposed through the GitBook web interface.
Our focus on community engagement in standards development means that our systematic review did not explicitly consider the machine-readability of data and metadata standards documented on GitHub.
Although machine-readability is the ultimate target, intermediate human-readable standards may be required to lower the barrier for standards adoption when developing meaningful formats and guidelines for metadata and data. Machine-readable standards are a cornerstone of FAIR data (Wilkinson et al., 2016), and multidisciplinary teams of domain scientists, informaticists, and computer programmers are critical to bridging the gap between human and machine-readable standards. Indeed, there are templates used by some data standard developers to render machine-readable standards into human-readable templates (e.g., https://github.com/tdwg/camtrap-dp/blob/main/_layouts/tables.html rendered as https://tdwg.github.io/ camtrap-dp/data/#deployments). As we move toward engaging the broader research community in adopting community-developed data standards, we see GitHub and autogenerated project websites as the platforms for responding to user feedback, posting tutorials for using standards, and versioning our supporting documents in response to the user community.

Conclusion
Community-developed data standards and reporting formats are a key step toward making data FAIR. VCS platforms can enable collaboration on documentation during and after the standards development process. However, our systematic review found that GitHub, and more broadly VCS platforms, are generally underused for collaboration on data and metadata standard development (30% of all standards were hosted on GitHub, and only one standard was used BitBucket out of 108 reviewed). Even among the GitHub repositories, many do not use important tools for collaboration such as issue templates, issue labels, licensing information, and hosting content on project websites autogenerated from GitHub content that can enable community discussion and feedback for improving the standards. At ESS-DIVE, we have used GitHub to enhance the development of our community data and metadata reporting formats, using the systematic review described in this paper to guide the structure and content of the ESS-DIVE Community Space on GitHub. The recommendations on VCS structure we outline here can be used by researchers developing data standards or reporting formats looking for greater community involvement in data stewardship.