OpenMindat: Open and FAIR mineralogy data from the Mindat database

The open data movement has brought revolutionary changes to the field of mineralogy. With a growing number of datasets made available through community efforts, researchers are now able to explore new scientific topics such as mineral ecology, mineral evolution and new classification systems. The recent results have shown that the necessary open data coupled with data science skills and expertise in mineralogy will lead to impressive new scientific discoveries. Yet, feedback from researchers also reflects the needs for better FAIRness of open data, that is, findable, accessible, interoperable and reusable for both humans and machines. In this paper, we present our recent work on building the open data service of Mindat, one of the largest mineral databases in the world. In the past years, Mindat has supported numerous scientific studies but a machine interface for data access has never been established. Through the OpenMindat project we have achieved solid progress on two activities: (1) cleanse data and improve data quality, and (2) build a data sharing platform and establish a machine interface for data query and access. We hope OpenMindat will help address the increasing data needs from researchers in mineralogy for an internationally recognized authoritative database that is fully compliant with the FAIR guiding principles and helps accelerate scientific discoveries.


| INTRODUCTION
Data science is changing the style of research in geosciences.Workflow platforms, data cleansing, statistics, machine learning, data mining and data visualization are often seen in the general work of geoscientists in nowadays.Exciting data-driven scientific discoveries have been increasingly made in mineralogy (Hazen et al., 2019), geochemistry (Keller et al., 2015), geophysics (Bergen et al., 2019), palaeobiology (Fan et al., 2020), palaeoclimatology (Ma, Hinnov, et al., 2022), plate tectonics (Merdith et al., 2021) and more.Geoscientists and data scientists have seen the big opportunities made available by building, sharing and exploring the 'big Earth data'.In recent years, many initiatives and programs were built to accelerate that process, such as the EarthCube program initiated by the National Science Foundation (NSF) in the United States (Richard et al., 2014), the Deep-time Data Driven Discovery (4D) Initiative started by the Carnegie Institution for Science (4D Initiative, 2019) and the Deeptime Digital Earth (DDE) big science program started by the International Union of Geological Sciences (IUGS) (Wang et al., 2021).Practitioners in the field of geoinformatics have also summarized methodologies and approaches, and shared their best practices, such as abductive analysis (Hazen, 2014), data exploration (Ma et al., 2017), knowledge-rich intelligent systems (Gil et al., 2019) and context-aware machine learning in geosciences (Reichstein et al., 2019), just to name a few.
Among the active works of data science for geoscience, mineral informatics (Prabhu et al., 2022) rises as a unique topic.It incorporates many of the methods and best practices mentioned above and highlights the advantages of shared open data, multidisciplinary working teams and iterative datathon activities.In recent years, the work of mineral informatics has proven that many scientific discoveries can be made when the data are in place and coupled with advanced analytical techniques and data science expertise.The research outputs on mineral evolution and ecology (Cleland et al., 2021;Hazen et al., 2019;Hystad et al., 2015) are good examples illustrating the key components and steps in the methodology of mineral informatics.Nevertheless, the practitioners, including the authors of this paper, also found that accessing adequate reliable open data often remains a challenge for mineral informatics (Ma, 2022).Through community activities (Golden et al., 2019;Prabhu et al., 2021;Wyborn et al., 2021), although researchers are gradually improving the FAIRness (Findable, Accessible, Interoperable and Reusable) (Wilkinson et al., 2016) of open data in the field of mineralogy, more work still needs to be done to fully meet the needs of mineral informatics, particularly for machine-to-machine interactions (Musen, 2022;Wyborn & Brownlee, 2022).
The purpose of this paper was to present our design and development of OpenMindat, a project that aims to build machine access interface for the Mindat database.Mindat, one of the largest mineral databases in the world, has been widely used by many geoscientists, especially those in the field of mineralogy, but the machine interface for bulk data download has never been fully established.Through a research project funded by NSF, the authors of this project have designed the technical structure for enabling machine readability of Mindat, established the open data service, and are now working on the technical extension.We hope OpenMindat will help meet the data-intensive research needs of many geoscientists.The remainder of the paper is organized as follows.Section 2 gives a brief introduction to the background of Mindat and the needs for better machine access from it.Section 3 presents the designed structure of OpenMindat, the established machine interface for open data service and the progress of other technical developments.Section 4 discusses the experience and best practices from OpenMindat and the work on mineral informatics, and it also lists a few items for future work.Finally, Section 5 concludes the paper.

STATUS OF MINDAT
Mindat was started as a private database of minerals and their localities around the world by Jolyon Ralph in December 1993.In October 2000, its website (www.mindat.org) was launched as a free-to-use resource targeted primarily at mineral collectors and amateur mineralogists based around a 'crowdsourcing' model where verified members can contribute new data for the database.Unlike the Wikipedia system where anonymous edits are possible, all contributions to Mindat must be approved by registered users and are peer-reviewed by regional experts.By July 23 of 2022, Mindat has 64,726 registered users and 6,865 of them have the 'data contributor' right.A management team of 53 mineralogical experts from around the world was selected to oversee the review process and to manage the direction of the site.Mindat is by far the most widely used mineral database in the world.In 2021 alone, it received 44,333,302 page views from 14,506,574 sessions of 10,148,136 unique visitors.Those website visitors are from 246 countries and territories, and about 32% of usage is from the United States.More impressively, Mindat has underpinned massive numbers of scientific discoveries in mineralogy, petrology, economic geology, geochemistry, planetary science and other related disciplines.As evidence, the number of scientific publications citing Mindat has greatly increased in recent years (Figure 1).
In contrast to those significant scientific merits and social impacts, the infrastructure development and maintenance of Mindat is struggling to keep pace with the overwhelming data needs, to take advantage of new technologies and to scale its base physical infrastructure.Similar to its 'crowdsourcing' data, the infrastructure development and maintenance of Mindat rely entirely on 'crowdsourcing' donations and sponsorships.Although all the data on the Mindat website are free for users to browse through well-organized GUI (Graphical User Interface), the machine interface for data access and download has never been fully established.An experimental simple API (Application Programming Interface) has been running on the Mindat web server for nearly a decade.However, it has not been made public because of the risk of overloading the server and bringing the website down for all other users.In the past decade, Jolyon Ralph has received numerous requests for sharing Mindat data.Very often, what Ralph could do is to design specific queries though the experimental API to download a part of the Mindat data, and then share with the data requester.This case-by-case data sharing mechanism is tedious while the data sharing needs are continuously increasing: clearly, this method will not scale.In worst-case scenarios, copies of Mindat have been downloaded and then republished as new vocabularies online without recognition of Mindat, and sometime with new Uniform Resource Identifiers that were given to the same terms.Not only is this practice unethical, but the mineral names/definitions may also be changed and hence a growing number of online mineral name vocabularies create confusion among users as to which is the most authoritative mineral database and which ones are likely to be sustained over the long term.To better serve both scientific and social needs, it is time to develop a fully open access and interoperable architecture for Mindat (i.e.OpenMindat), and make it an active node in the geoscience cyberinfrastructure ecosystem that is recognized as an international authoritative resource.

| Major data components and work plans to implement FAIR principles
By July 2022, the Mindat website hosts more than 18 million pages, and it uses about 25.8 TB disk space (Ralph, Martynov, et al., 2022).Mineral species, locality descriptions, mineral occurrence information and photographs of specimens are the most highlighted records available on the site.Each mineral occurrence is a single named mineral species reported from a single named locality (Figure 2).Mindat provides detailed attributes for each mineral species, such as those subjects listed in the right part of Figure 2. In recent years, Mindat has also set up interfaces with many other open geoscience data resources and imported records about other data subjects, such as rocks and fossils.On the GUI of Mindat website, geological map services from other data portals such as Macrostrat (Peters et al., 2018) can also be loaded to illustrate the geological environment of localities and mineral occurrences.Geological age of mineral occurrences has recently drawn attention in Mindat as it is an essential attribute for the study of mineral evolution as well as the co-evolution of geosphere and biosphere (Golden et al., 2019;Hazen & Ferry, 2010).From a broad perspective, we can map locality and geological age into the classes Spatial Object and Temporal Entity in GeoSPARQL (Battle & Kolas, 2011) and the Time Ontology (Cox & Little, 2020) respectively (see top part of Figure 2).The locality records show diverse patterns, such as literal address written in a hierarchal form, coordinates of a point and coordinates of a region and more.Similarly, the age records can also show in literal and/or numeric forms.In Mindat, significant efforts have already been spent to cleanse and reconcile the locality and age records into standardized forms.The comparison to community-level ontologies and data models such as GeoSPARQL and Time Ontology has provided 'food for thought' for us to address further topics of data interoperability in Mindat and work towards the FAIR principles.
Comparing with Figure 2, the detailed records in Table 1 give a more comprehensive view of the data components in Mindat as well as their properties, data sources, community standards and progress and concerns in data curation.The work on OpenMindat also provides an opportunity for us to take stock of the current records in Mindat and draw plans for cleansing and extension.With current Mindat as the foundation, OpenMindat will connect to a range of authoritative geoscience databases, collect properties to enrich the specification of mineral data, leverage community-level standards and crowdsourcing efforts to cleanse the records, and share data with other databases and the broad community of users.
The designed structure of OpenMindat is shown in Figure 3.Besides leveraging existing data resources in the geoscience community, OpenMindat will also incorporate several state-of-the-art theories, platforms and technologies in the development.FAIR principles (Wilkinson et al., 2016) will be used to guide the data curation and service construction to increase the findability, accessibility, interoperability and reusability of data for both humans and machines.In the detailed technical development, Schema.org(Noy et al., 2019) will be used in the OpenMindat architecture to increase the findability, through the EarthCube GeoCODES and the Google Dataset Search, which will help increase the visibility of OpenMindat data to a much broader community and make it easier to be integrated with other open geoscience data.The database structure and nomenclature will adopt disciplinary standards such as those ratified by the International Mineralogical Association (IMA) and IUGS, thus, to help link OpenMindat to the Open Knowledge Network (Guha & Moore, 2016).A Service-Oriented Architecture (Fielding, 2000) will be applied to support the machine interface for data access.Packages in Python and R will be developed to enable data query and access directly from workflow platforms, such as Jupyter and RMarkdown, to facilitate reproducible workflows and open science.Several cloud-based facilities, such as PanGeo (Arendt et al., 2018) and Geoweaver (Sun et al., 2020), will be reused to tackle big data processing in workflows.We will also leverage the popular Mindat website to publish updates about OpenMindat and facilitate data science activities among the geoscience community.

| Progress of technical development and results
Following the structure and roadmap planned in Figure 3, we have made solid progress on two major topics: (1) Cleanse data and improve data curation and (2) Build a data-hosting platform and establish the machine interface.The subsections below will give details about the development activities and the current outputs.

Cleanse data and improve data curation.
As a large crowdsourcing database, Mindat has inherent biases within the data due to the sources of data and the areas of interest of people willing to contribute (Geldmann et al., 2016;Kosmala et al., 2016;Ralph, Martynov, et al., 2022).For example, the most common rock-forming minerals are vastly under-represented.The feldspar group represents more than 50% of crustal volume, yet they are not among the most reported minerals on Mindat.Amusingly, the relatively rare mineral gold is among the very most cited minerals, because every 'gold showing the major data components in Mindat, with more details on mineral species, mineral occurrence and locality.
A B L E 1 Key data subjects in OpenMindat and their connections to the geoscience cyberinfrastructure ecosystem (Records collected in July 2022).mine' must have the mineral gold.The same is not true for most other commodities: most copper mines do not have the mineral copper, for instance.Other biases are simply due to the lack of available information on many geographic areas worldwide.Fortunately, these biases are well understood and can be dealt with by statistical models (Gonsamo & D'Odorico, 2014;Robinson et al., 2018;Xue et al., 2016).We have conducted a thorough review detailing the known biases and areas of doubt within the Mindat database and have used that as a guide to cleanse the records before releasing them on the API.

Subject
To facilitate ethical data reuse, we have planned a framework of data access licence to different data types in Mindat to support OpenMindat.The Creative Commons License (i.e.CC-BY-NC-SA 4.0) is a popular choice for licencing open data.However, as part of this procedure, we have discussed with our existing partners and source databases about the impact of such a licence.We are creating a forked copy of the CC-BY-NC-SA 4.0 licence tailored specifically for these needs.Such a licence will allow free-of-charge access and use of the OpenMindat data for non-commercial purposes with as few restrictions as possible.The share-alike directive in the proposed licence ensures that any database derived significantly from OpenMindat would also need to be licensed under a similar permissive licence.This protocol will encourage the growth of more open data and open science projects (cf.Rule 2 in Cox et al., 2021).For example, the information about a locality, it's coordinates, recorded mineral list and reference list would fall under this new licence.If a locality contains a GeoJSON polygon that has been taken from OpenStreetMap, this falls under OpenStreetMap's ODbL licence (which allows us to redistribute their data with attribution).In comparison, some other data types in Mindat should use different licences.For example, a photograph of a mineral specimen remains the copyright of the photographer and OpenMindat does not have any right to distribute this image without permission.
2. Build a data-hosting platform and establish the machine interface.
In order to provide a robust data hosting platform for the data access and downloads, OpenMindat has a new server physically separate from the existing Mindat web server (Ma, Ralph, et al., 2022;Ralph, Ma, et al., 2022).The hosting platform is an Infra-2 Infrastructure Server from OVHCloud with 64 GB of RAM, 3× 3.84 TB SSD disks in Soft RAID configuration and 1 Gbps unmetered public bandwidth.The current Mindat website has 20 years of experience in running a highly popular data-intensive website from dedicated servers.We have collaborated closely with the Mindat server provider to build the platform for OpenMindat, to ensure fast server-to-server connectivity for keeping OpenMindat up-to-date with the primary Mindat database.The Mindat website is built on PHP/ MySQL technology.The OpenMindat hosting platform will be connected with a synchronized MySQL database to the live Mindat server so the data on OpenMindat is always up to date.MySQL synchronization has a low impact on server performance and network bandwidth (Pohanka & Pechanec, 2020).
The OpenMindat data API has been established and made open.It deploys a RESTful API (Richardson & Ruby, 2008) structure and allows programmatic access to the data on demand.Any current registered users of Mindat with the 'data contributor' status can find an API token on their user profile.A newly registered user of Mindat can also request the 'data contributor' status and obtain the API token.A tutorial about user registration and API token request was shared by Zhang (2023), which also includes links to shared Jupyter workflows of data query as well as documentation of the API parameters (Also see below the section 'Data availability statement').With the API, various data queries are now available.The following is a list of representative use cases based on the requests and feedback from Mindat users: (1) Retrieve a full list of all IMA-approved mineral species with detailed properties.( 2) Retrieve a list of mineral species matching certain chemical criteria, such as 'mineral species containing nickel or cobalt, with sulphur but without oxygen'.(3) Validate alternative mineral/rock names.For example, if the name 'amethyst' is sent, then it would return that the correct mineral species is quartz, and that 'amethyst' is a varietal name.(4) Provide a hierarchical taxonomy of petrological names and their definitions.This function will enable other geoscience data repositories, portals and platforms to readily utilize the taxonomy.( 5) Provide a list of recorded mineral localities within a given area, such as an address, a polygon or a buffer zone from a point.Except item (3), for which the data are still under cleansing, all the other use cases in the above list can now be realized through the OpenMindat data API.

| Ongoing work to further improve the FAIRness of OpenMindat data
Together with the OpenMindat server, we are working on two other approaches that help users access the open data.The first is developing packages in Python and R to enable data query and access directly from workflow platforms such as Jupyter and RMarkdown.The functions in those packages will wrap the detailed technical process of the RESTful API data access, and enable capabilities of interest to geoscientists, such as data query, pre-processing and integration.We will apply a use case-driven iterative approach (Ma et al., 2014) to analyse the needs of geoscientists and develop the functions accordingly.The second is building copies of datasets on certain topics and enabling bulk data downloads.These pre-packaged downloads would be rebuilt approximately once every 2 weeks depending on server usage.These will be available in JSON, CSV and MySQL format.Data downloads will not include the big chunks of data imported from other databases as this should instead be accessed directly from their original source.
OpenMindat has joined the open data activities in the geoscience community to further increase FAIRness of the share data.GeoCODES (McHenry et al., 2021) has been successfully aggregating standardized metadata from many geoscience data repositories and facilitates and building a centralized index portal to allow researchers to quickly find data of interest.Another advantage of GeoCODES is that, since it adopts the technical approaches of Schema.org, the metadata released on each individual geoscience data repository as well as the assembled metadata at GeoCODES will all be indexable by Google Dataset Search (Noy et al., 2019).This feature will further increase the discoverability of the data, as Google has a huge user community.Mindat has already deployed a framework to host a unique webpage for each mineral species, together with a schema for the metadata.We will further refer to the IMA mineral name list and the mineral semantics discussed in Prabhu et al. (2021) to refine the elements in the metadata schema.The metadata recommendations of GeoCODES will be adapted to make our metadata schema compatible with Schema.org.Then, we will develop a program to automatically generate a piece of JSON-LD code for the metadata of each mineral species and make them indexable by GeoCODES and Google Dataset Search.This feature will help users quickly find webpages of interest on the Mindat website.For example, when they use the Google Dataset Search engine to find a certain mineral, the results will show links to Mindat.Such functionalities will be complementary to the OpenMindat data API.That is, although users can query and download datasets from the API, the mineral species webpages are still able to provide complementary information to human readers, and the inserted metadata can increase their visibility to search engines and improve the efficiency of search.

| DISCUSSION
Mindat has demonstrated thriving sustainability, mainly due to the mechanism of crowdsourcing in both data collection and data quality control and an increasing demand for accurate, trusted information.This experience can be shared with many other open data portals in various geoscience disciplines.Mindat has a very small team working on the technical development of the database software and the website in comparison to the large numbers of data contributors and data curators (See Section 2).While the data contributors across the world scale up the data input significantly, the mineralogical experts in the data curation and management team ensure the correctness, completeness and consistency of the uploaded records before they are made public.The unique data subjects in Mindat, such as mineral occurrence (i.e.species-locality pair) make the database a useful resource for exploring cutting-edge research topics, such as mineral ecology and mineral evolution (Hazen et al., 2019).The comprehensive mineral species name list, detailed attributes, integrated links to other data resources and friendly user interface make it a popular reference among both geoscience professionals and the general public for mineralogical information and knowledge.
Nevertheless, comparing with the guidelines in the FAIR data principles, Mindat still faces many challenges, which are the driving force for us to propose and build OpenMindat.We need to note that, although OpenMindat has a focus on mineralogy, it will not be a data island only limited to a single discipline.Instead, OpenMindat will refer to a list of existing data portals, resources and software (Table 2) to facilitate data interoperability, enrich data subjects and records and establish new functions.For instance, locality and age are the two data subjects that are in urgent needs from many geoscientists.For the heterogeneous locality records, we will leverage local experts to merge duplicate textural records and apply OpenStreetMap's nomination service and similar tools to get latitude and longitude records.Among the current Mindat locality records (Table 1), about 41,000 of them are with margin of error on estimated coordinates between 3 and 500 km.Improving locality information is thus a priority; we hope to get at least 50% of these to under 3 km margin of error in the next 3 years.The age information of mineral compositions is invaluable to the study of mineral evolution.There are currently over 5,500 age records in Mindat.We will import all the 21,000 age records from the RRUFF Mineral Evolution Database (Table 2) and continue to add age records by exploring other resources.
We have scheduled a two-phase development plan for the OpenMindat data API.What we have established so far is the first phase, in which the API can search any of the datasets in OpenMindat and is able to combine them where possible.Also, the API is able to specify which variables are needed in the output.For example, a complex query could be described as 'Return all minerals and their references where the mineral contains cobalt but not oxygen and is found in South Africa or Zambia'.Results are returned in the format specified by the user.The second phase of the API, which we plan to build, will allow OpenMindat to accept new data from users into the system.Such data will need to come from registered users and will need to go through a process of review and verification before they become part of the public data.Data uploads will be limited to a number of subjects in the existing data structure of OpenMindat, including locality, mineral occurrence, specimen analysis and photograph, and those new data will be synchronized with Mindat.
Mindat has proven to be a unique and invaluable resource for scientific research, and we expect OpenMindat will further facilitate the growth of new hypotheses and discoveries.For instance, in the study of mineral ecology, the mineral/locality data from Mindat have been used to document the diversity/distribution of minerals on Earth today and predict the as yet undocumented diversity of minerals (Hystad et al., 2015(Hystad et al., , 2019)).In the study of mineral evolution, based on analysis of more than 100,000 dated mineral-locality pairs, it has been discovered that Earth's mineralization has been episodic, with close associations of diverse mineral formation to intervals of supercontinent assembly, and intervening periods of reduced mineral formation (Hazen et al., 2014).Recently, Hazen and Morrison have introduced a new 'evolutionary system of mineralogy' that attempts to place all of Earth's minerals in their temporal and paragenetic contexts (Hazen & Morrison, 2020).These examples are just a small part of the innovative discoveries enabled by Mindat in various disciplines.Such data-intensive scientific pursuits will be significantly leveraged with the open data provided by OpenMindat.In our previous studies, we have also summarized experience and best practices of applying data science in geoscience, including the abductive process (Hazen, 2014), visual exploratory analysis (Ma et al., 2017) and Agile-style datathon activities (Ma, 2022).Other scientists using OpenMindat data, if needed, may adapt those experience and best practices in their own studies.
The open data community has been promoting the concept of 'ecosystem', which means a virtuous circle of interacting factors such as data, tools, researchers and scientific discoveries.To establish OpenMindat as an active node in the open data ecosystem, we have allocated tasks in the above-mentioned technical developments.The data quality improvement, stable data API and direct data search and access from workflow platforms are the key items.Those outputs, once fully established, will support fluent data flow and reduce the burden of data access and cleansing for scientists.Additionally, we have a few other thoughts related to the work on OpenMindat.One of them is the semantics of mineral data.For instance, mineral species, locality, age and rock are among the key objects in OpenMindat, and all of them face the challenges of machine-readable semantics, such as shortage of community-level data models, varied data representations and formats and heterogeneity in terminology.Geoinformatics researchers have recognized those challenges and built some initiatives (Prabhu et al., 2021).Another thought is to facilitate the usage of OpenMindat and accelerate new scientific discoveries in more areas.We plan to collaborate with communities such as the American Geophysical Union, the Geochemical Society and the Geological Society of America to organize workshops and conference sessions to provide training opportunities for geoscientists and students.Overall, the aim of OpenMindat is to establish an open access data portal for the popular Mindat database, to boost new scientific discoveries in the evolutionary system of mineralogy.A key part of the activities is to establish a FAIR framework to curate scientifically important properties in mineralogy and develop machinefriendly interfaces for data access.We have enriched and cleansed massive records related to all the mineral species recorded in Mindat and have established an open data API to enable data query and download.Moreover, we will establish OpenMindat's connection to EarthCube GeoCODES, and develop packages in Python and R for querying and accessing OpenMindat data from workflow platforms.Based on our previous successful discoveries in data-intensive geoscience studies, we firmly believe that opening the Mindat data for free academic use will encourage a new generation of mineralogical research, including projects that we would never have imagined ourselves.We are also optimistic that by making Mindat easier to access, particularly for machines, that there will be less copying and republishing of Mindat data on multiple disconnected websites that do not automatically update as changes are made to Mindat.We hope that OpenMindat will encourage a greater involvement from the wider geoscience community, with more people willing to share their data for the benefit of all, rather than creating their own local resource online.We are convinced that multi-dimensional analysis of these mineralogical data from a recognized, authoritative OpenMindat server will lead to important new revelations.

F
Number of scientific publications citing 'mindat.org'during the period 2006-2021 (records from Google Scholar).

Number of records Properties & description Community standards, data sources and curation concerns
F I G U R E 3 OpenMindat's technical structure and its approach to leverage and reuse existing data resources, tools and infrastructure.
Identified data resources for complementing OpenMindat records and functions.
T A B L E 2