An image dataset of fusulinid foraminifera generated with the aid of deep learning

Fusulinid foraminifera are among the most common microfossils of the Late Palaeozoic and act as key fossils for stratigraphic correlation, paleogeographic and paleoenvironmental indication, and evolutionary studies of marine life. Accurate and efficient identification forms the basis of such research involving fusulinids but is limited by the lack of digitized image datasets. This article presents the first large image dataset of fusulinids containing 2,400 images of individual samples subjected to 16 genera of all six fusulinid families and labelled to species level. These images were collected from the literature and our unpublished samples through an automatic segmentation procedure implementing BlendMask, a deep learning model. The dataset shows promise for the efficient accumulation of fossil images through automated procedures and will facilitate taxonomists in future morphologic and systematic studies.


| INTRODUCTION
Foraminifera are a diverse group of marine unicellular protists usually growing shells (called 'tests') formed of various materials, with one or more apertures allowing the rhizopodia to emerge (Gupta, 1999;Saraswati & Srinivasan, 2016).They have an extremely long evolutionary history, possibly from Neoproterozoic, according to molecular evidence (Pawlowski et al., 2003;Tappan & Loeblich, 1988), to the modern days.Throughout their history, they exhibit high diversity and serve as an important component of the marine ecosystem (Gupta, 1999;Vachard et al., 2010).Fusulinids, herein refer to Order Fusulinida (Wedekind, 1937) under Class Fusulinata (Dubicka et al., 2021;Vachard et al., 2010) are Palaeozoic larger benthic foraminifera that developed calcareous, planispiral, mostly spindle-shaped tests with dozens to over 100 segmented chambers (Figure 1a).They appeared in the shallow sea of the Carboniferous and then thrived until the Late Permian (Pawlowski et al., 2003;Vachard et al., 2010).They evolved quickly with rapid morphological changes in the tests and therefore have been accepted as an eminent index fossil for the Late Palaeozoic biostratigraphy (BouDagher-Fadel, 2008;Ross & Ross, 1991).They are widely distributed in tropical, subtropical and temperate regions around the world while exhibiting paleobiogeographic significance (Arefifard & Clapham, 2021;Ozawa, 1987;Ross, 1995;Shi et al., 2017).Their benthic adaptation, along with the dynamics in biodiversity and community structure, have also been widely studied to recover the shallow marine environment in the Carboniferous and Permian periods (Davydov et al., 2013;Ross, 1969;Vachard et al., 2010).
The fusulinid classification is mainly based on their wall (named 'spirotheca') structure and internal endoskeletons such as folding state of the chamber partitions (septa folding, Figure 1a,c), development of the calcite double ridges in the central area (chomata) or multiple ridges in the chambers (parachomata).Most of these morphological characters could be observed in the handgrounded thin sections parallel to fusulinid test-coiling axis and containing the beginning chamber (proloculus) (Figure 1b), so called axial sections.Axial sections display the internal chamber morphology through fusulinid ontogeny and therefore are standard sections for systematic study (Sheng et al., 1988;Vachard et al., 2010).The sagittal sections, vertical to the coiling axis and containing proloculus (Figure 1b), provide only limited information.Hundreds of scientific monographs and articles on fusulinids have been published regarding the investigations mentioned above since the nineteenth century, in which a large number of specimen photos and/or images, mostly on the abovementioned two types of sections, are present.These photos and images formed the basic materials for fusulinid taxonomists while the access to real specimens is oftentimes restricted by various conditions, and the identification quality of different taxonomists often varies due to experience, criteria and examined image sets, same as in many other identification practices (Culverhouse et al., 2014;MacLeod et al., 2010).
With enough images involved, quantitative morphological analyses (Arefifard, 2019;Huang, 2011;Shi, 2021;Shi & MacLeod, 2016) and automatic identification systems (Mitra et al., 2019;Pires de Lima et al., 2020) can help improve the consistency as well as efficiency of identification practices.An example is 'Endless Forams' (Hsiang F I G U R E 1 Fusulinid sketch and two image data sources.(a) Cutaway view of a schwagerinid fusulinid and its typical features of septation and endoskeleton, after Dunbar and Condra (1927).(b) Schematic diagram of the axial and sagittal section of a fusulinid fossil, after (Sheng et al., 1988).(c) A thin-slice photo of limestone-preserved fusulinids, with main characters illustrated.(d) Scanned image example of a piece of literature on fusulinids.et al., 2019), an image dataset that provides over 34,000 images of 35 modern foraminifera species.Automated species-level identification methods (Hsiang et al., 2019;Karaderi et al., 2022;Marchant et al., 2020) were developed based on this dataset, and their performances are comparable to human experts.Several image datasets for fossil organisms have been made available, including foraminifera (Hsiang et al., 2019;Mitra et al., 2019), graptolites (Niu & Xu, 2022), fossil leaves (Wilf et al., 2021) and multiple-body-fossil mixture (Liu et al., 2022).However, only a few efforts have been seen in fusulinid studies, for example, Pires de Lima et al. ( 2020) utilized roughly 300 photos to automatically identify eight genera.
Fusulinid images can be accumulated in two ways.For specimens usually preserved in limestones, thin slices must be prepared and photographed under a microscope (Figure 1c).On the other hand, published figures can be easily obtained by photo scanning using a flatbed scanner (Figure 1d).Manual segmentation is traditionally needed for both procedures to separate the specimen from the surrounding sediments in the slice photos or segment the individuals in the scanned figures.Segmentation used to be the most time-consuming part of the whole process, but now it can be overcome with newly developed deep learning frameworks.Here to address this issue and put it into practice, we segmented images from both our photographed slices and 49 pieces of literature with the aid of deep learning techniques to generate the first large image dataset of fusulinids containing 2,400 images of 16 genera covering all the fusulinid families, following the classification of Sheng et al. (1988).While the dataset could support the future systematic study of fusulinids, the auto-segmentation technique is also invaluable for image acquisition for all fossils.

| Data summary
Our dataset comprises 2,400 thin-slice images of fusulinid individuals, including 295 microscope photos and 2,105 scanned images from the literature.Images are stored as PNG files with the transparency channel annotating the outline of the fossils, and labelled according to their species name and data source.The 2,400 images are selected equally from 16 genera of all six fusulinid families including Fusulinidae, Schwagerinidae, Ozawainellidae, Schubertellidae, Neoschwagerinidae and Verbeekinidae.The most distinct morphological characters for these six families, based on Loeblich and Tappan (1964) and Sheng et al. (1988), include: (1) fusiform test for fusulinids with spirotheca composed of three or four layers (tectum and upper and lower tectoria with transparent diaphanotheca in between, upper tectoria sometimes absent, see Figure 2a); (2) medium to large fusiform test for schwagerinids with their spirotheca composed of tectum and alveolar keriotheca (Figure 2b); (3) small, umbilicate test for typical ozawainellids with length shorter than width (Figure 2c); (4) small to medium fusiform test for schubertellids with spirotheca composed of three or four layers (Figure 2D); (5) mosty inflated fusiform test for neoschwagerinids with significant parachomata and septula developed (Figure 2e); (6) medium to large, spherical to subspherical test for verbeekinids with distinct parachomata developed (Figure 2f).Table 1 is the overview of the taxonomy and the amount of data.Images of holotypes, paratypes, cotypes and syntypes of the selected species are preferably chosen.

| Data source
The 295 photographed specimens were collected from Guizhou and Yunnan in China by the authors and photographed using cameras attached to a transmitted light microscope (Figure 3a).The other 2,105 images are obtained from 49 pieces of scientific literature and atlases.The monograph Catalogue of Foraminifera (Ellis & Messina, 1940-2015) provides the largest portion (922 images) of those images as it contains a wide range of species type descriptions and illustrations.Other literature includes a large body of publications reporting fusulinids recovered in Chinese geological sections.

| Image auto-segmentation
Deep learning was implemented to segment figures and thin-slice photos to obtain images containing one individual each (Figures 3a-c).Images in the literature have been segmented by the authors so the edges of specimens are easy to detect, thus algorithms based on edge detection could achieve good results.However, in rock slice photos, fusulinid specimens are not well distinguished from the surrounding sediments and a more advanced technique would be needed.
The past two decades have seen a rapid development of computer vision along with the renaissance of artificial neural networks (ANNs).ANNs consist of a collection of 'neurons' and 'edges' that connect them.Each edge passes information between neurons according to a 'weight' that specifies the relative importance of that message.The breakthrough of ANNs implementation in the field of image recognition is convolutional neural networks (CNNs).Since the modern CNN framework (LeCun et al., 1989(LeCun et al., , 1998) ) was established, there have been rapid advances in the development and application of CNNs.The basic components of CNN consist of three layers: convolutional, pooling and fully connected.Convolutional layers learn feature representations of the inputs, pooling layers achieve shift-invariance by reducing the resolution of the feature maps, and fully connected layers aim to perform high-level reasoning (Gu et al., 2018).
AlexNet (Krizhevsky et al., 2012) was among the first CNN to win the ImageNet Large-Scale Visual Recognition Challenge (ILVRC) (Russakovsky et al., 2015), and it enabled deep learning as a ubiquitous technique for improving model performance.Following that, models such as VGGNet (Simonyan & Zisserman, 2014), ResNet (He et al., 2016) and DenseNet (Huang et al., 2017) successively represented the state-of-the-art at the time and remain the benchmarks for major models today.The application of CNN in computer vision is also introduced into more fields other than image classification including object detection, image segmentation and image generation.Image segmentation technologies are mainly divided into semantic segmentation and instance segmentation.Semantic segmentation associates each pixel of an image with a class label, while instance segmentation is comparatively more challenging as it masks each instance of an object contained in an image independently (Sultana et al., 2020).U-net (Weng & Zhu, 2015) and Mask R-CNN (He et al., 2017) are two of the representative models for image segmentation, which have been applied in many fields including palaeontology (Carvalho et al., 2020;Johansen et al., 2021;Yu et al., 2022).In our study, we used the model BlendMask for instance segmentation in consideration of the need to extract each fusulinid individual in a thin-slice image.
BlendMask (Chen et al., 2020), developed in 2020, is a one-stage detection framework that utilizes a blender module to draw both top-down and bottom-up information.There are two paradigms used in models for object detection and image segmentation that apply deep learning.A one-stage framework implements only a convolutional neural network without another network called 'head' for dense prediction of the most likely regions of interest (RoIs), like that in two-stage frameworks (Ren et al., 2017).One-stage frameworks are generally worse than two-stage frameworks like Mask R-CNN in terms of mask precision but possess better multi-platform, multi-task applicability and higher speed.Recent developments like the object detection model FCOS (Tian et al., 2019) prove that one-stage frameworks can outperform two-stage ones.The problem with utilizing one-stage models for instance segmentation lies in the 'direction' the information drawn from images, either top-down (extrapolating instance masks with a window) or bottom-up (grouping individual pixels to form masks).However, they both suffer the loss of information from the other direction.The Blender module in BlendMask combines both high-level instance and lowlevel semantic information, generating high-resolution masks with good precision and spending less time for Note: The Fusulinid classification system follows Sheng et al. (1988).inference, compared to former models (Chen et al., 2020).BlendMask is published along with a PyTorch-based code that is easy to port and use (see GitHub, https://github.com/aim-uofa/Adela iDet).
We selected ResNet50, a medium-sized CNN that can achieve good results using reasonable computational resources, as the backbone of our BlendMask model.The model was trained on 400 manually labelled images consisting of both thin-slice and literature figures, where the outline of each fusulinid individual is marked with a uniform label.We used LabelMe software with a graphical interface for the labelling process.The labelled images are then fed into the model for training for 10,000 iterations, with hyperparameters kept as default except for batch size being set to 6.The loss dropped dramatically in the first 2000 iterations and then decreased slowly.After the total loss became steady at around 0.64, the training process was stopped at the 10,000 round to prevent overfitting (Figure 4a).The trained model can perfectly identify and segment fusulinid images from both the literature and raw thin-section micrographs (Figure 4b).Segmentation faults, including omissions, misidentifications and incomplete segmentations, were rarely seen and mostly occurred in poor-preserved fossils.Those images were screened out from the final dataset.With the help of the deep learning model BlendMask, around 10,000 images of fusulinid individuals were segmented within a few hours.The model was trained and applied on a PC with an RTX 3080 Ti graphic card.

| Image scale
In the literature, the scales for fusulinid images are reflected by either a bar attached to the image or by a magnification description in the explanatory text.Although the scale is generally important for fusulinid identification, the current automatic segmentation model cannot directly extract scale information from the images or texts.Instead, the sizes of images are kept as original, and their magnification descriptions are listed in the CSV file attached to the dataset.

| Image selection and labelling
Images of the final dataset all illustrate axial sections of fusulinid individuals, and sagittal sections are excluded.Images with poor quality, with most features of the specimen such as test outlines, septa and chomata, are incomplete or not clearly shown either for preservation or printing, are excluded too.Specimens identified with open nomenclatures are not included.
The label of the image is reflected in its file name, which contains its genus name, subgenus name (if any), species name, subspecific or infrasubspecific names (if any), source literature number and a sequential number (relative to all images of the same species derived from the same source).The species names on the label are kept the same as the original description of the authors.Tags used to mark subgenera, subspecies, variants, etc. (e.g.var., ssp.) are recorded in a separate CSV file attached to the dataset, along with other information.Inconsistency almost certainly lies in the identifications of different authors, especially for genera with many species like Schwagerina and Pseudofusulina.However, a hasty revision of the identification results would be subjective, thus the original author's opinions are respected and reserved here.

| Image enhancement
An enhancement process was employed to reduce discrepancies in image quality irrespective of fossil structure.This included rotation correction and histogram equalization (Figure 3d).The process was carried out after converting the images to greyscale.The conversion aimed to eliminate noise brought by colour in slice photos.
An automatic rotation method based on linear detection was used to reorient the segmented images to preferred angles.A commonly used line detection method with relatively low computational complexity called the Hough line transform (Hough, 1962) was implemented to get the distribution of the lines in the image and their average offset angles.Correction methods based on fossil contour were not adopted, as fossils may have irregular contours due to their preservation.
Contrast Limited Adaptive Histogram Equalization (CLAHE) (Pizer et al., 1990), a variation of Adaptive Histogram Equalization (AHE), is frequently used in medical imaging like CT scans to enhance the contrast of images (Pizer et al., 1987).We implemented CLAHE to balance the contrast of fusulinid images from different sources or different parts of the image.In the general AHE, an image is partitioned into a rectangular grid of contextual regions, and the respective histogram and optimal contrast are calculated for each subregion to obtain a grey-level assignment table for local contrast optimization (Pizer et al., 1987).A major drawback of this approach is the visible background noise, especially when applied to scanned images.CLAHE uses restricted contrast enhancement that crops the histogram and rearranges it evenly when generating a grey-level assignment scheme to a homogeneous region consisting of pixels concentrated in a certain greyscale range (usually representing background area), which greatly reduces background noise when compared to AHE (Reza, 2004).
To reduce artificial influences, no further adjustments were made.Almost all images have noise generated from various types of noise models at any point of the image generation procedure (e.g.acquisition, conversion and storage) (Boyat & Joshi, 2015).Currently, most of the prevailing denoising methods corrupt the information of the image to a certain extent, so noise reduction was not performed.The image enhancement methods used here were implemented using OpenCV, an open-source platform for image processing.

| POTENTIAL USAGE
As the first large image dataset for fusulinids, the release of the 16-genus species-level dataset will facilitate future systematic and morphometric studies.The dataset provides excellent comparative and training material for covering the six fusulinid families and containing type specimens of most species.Although the scanned images from the publications would have poorer resolution, the recent image enhance technique would solve this problem to a large extent and is strongly recommended to apply before further image analyses.Those images are retained with the original resolution in the dataset to keep the authentic information.The dataset also provides sufficient data volume for large-scale morphologic analyses, which typically utilized tens to hundreds of samples before (Shi, 2021;Shi & MacLeod, 2016).The easy accessibility and large data volume allow it to be applied for research purposes as well as for teaching and training.More importantly, the machine learning technique to set this dataset lay the foundation for subsequent innovative studies.A large amount of fossil individual images could be accumulated from not only body fossil photos but thin-slice micrographs through an automatic procedure.It shows the promise of machine recognition in the field of palaeontology while human face recognition (Du et al., 2022) and extant organism identification (Wäldchen & Mäder, 2017, 2018) based on machine learning techniques are already operational.Although relatively small compared to other recognition tasks, our dataset is sufficient in terms of data volume to support the training of classical deep learning networks to recognize different taxa of fusulinids.

F
The process of utilizing deep learning to build the fusulinid dataset.(a) Raw image acquisition-Raw images were acquired by taking micrographs of rock samples and scanning the figures in publications.(b) Training of the deep learning segmentation model-Three hundred and ten raw images were selected for manual annotation and then used to train BlendMask, a neural network for instance segmentation.(c) Automatic image segmentation-The trained model was applied to segment other raw images into images containing only one fossil each.Images were then carefully selected and labelled.(d) Image processing-Labelled images were re-orientated using the Hough line transform function, and were stored in the data repository after being enhanced by applying CLAHE to equalize their contrast.

F
Performance of BlendMask in fusulinid image segmentation.(a) Loss curve of the training process.Total loss is the summation of semantic loss, FCOS loss and mask loss (Chen et al., 2020; Tian et al., 2019).(b) Examples of segmentation results from the literature figures (left) and thin-slice photos (right).Fossil individuals are marked with coloured masks.The numbers above the boxes indicate the recognition probability (only those >50% are masked and shown).
stored as a ZIP file at Deep-time Digital Earth international big science program (DDE) Data Publisher & Repository (http://repos itory.deep-time.org/detai l/15912 88605 46872 5250).The file contains two main folders of the dataset (the original images and CLAHEprocessed images), and each contain a total of 16 folders organized by genus name, each containing 150 images subjected to that genus.The images are in PNG format with the transparency channel specifying the mask of fusulinid individuals.A CSV file is included to state information such as image file name, taxonomic information, specimen information, magnification and data source.References of the data sources are shown in an XLSX file.A TXT file is also included to clarify additional information, such as file naming rules.

Subfamily Genus No. of images from thin-slice photos No. of images from the literature Total No.
Overview of the taxonomy and the amount of data from two sources.
T A B L E 1