Automated extraction of seed morphological traits from images

The description of biological objects, such as seeds, mainly relies on manual measurements of few characteristics, and on visual classification of structures, both of which can be subjective, error prone and time‐consuming. Image analysis tools offer means to address these shortcomings, but we currently lack a method capable of automatically handling seeds from different taxa with varying morphological attributes and obtaining interpretable results. Here, we provide a simple image acquisition and processing protocol and introduce Traitor, an open‐source software available as a command‐line interface (CLI), which automates the extraction of seed morphological traits from images. The workflow for trait extraction consists of scanning seeds against a high‐contrast background, correcting image colours, and analysing images with the software. Traitor is capable of processing hundreds of images of varied taxa simultaneously with just three commands, and without a need for training, manual fine‐tuning or thresholding. The software automatically detects each object in the image and extracts size measurements, traditional morphometric descriptors widely used by scientists and practitioners, standardised shape coordinates, and colorimetric measurements. The method was tested on a dataset comprising of 91,667 images of seeds from 1228 taxa. Traitor's extracted average length and width values closely matched the average manual measurements obtained from the same collection (concordance correlation coefficient of 0.98). Further, we used a large image dataset to demonstrate how Traitor's output can be used to obtain representative seed colours for taxa, determine the phylogenetic signal of seed colour, and build objective classification categories for shape with high levels of visual interpretability. Our approach increases productivity and allows for large‐scale analyses that would otherwise be unfeasible. Traitor enables the acquisition of data that are readily comparable across different taxa, opening new avenues to explore functional relevance of morphological traits and to advance on new tools for seed identification.


| INTRODUC TI ON
The extraction of morphological traits from biological objects, such as seeds, is a highly specialised task that has historically relied on manual quantification and classification of morphological attributes.
Manual quantitative measurements of seed size and surface structure (Kleyer et al., 2008) are time-consuming, limiting the number of species and replicates that can be measured. Contrastingly, assigning labels to seed traits based on visual perception can be a relatively quick process, but lacks sufficient objectivity as subtle differences in colour and shape among seeds require refined categories to separate taxa (Choi et al., 2012;Shimai, 2022) and leads to frequent inconsistencies in classification. These limitations have hindered the advancement of seed identification tools and the compilation of standardised seed morphological traits at global scale (Saatkamp et al., 2019).
Image analysis methods offer an alternative approach for rapid and objective characterisation of biological objects. The first step for image processing is the segmentation of objects from the background, which is critical for the accuracy of all measurements.
Well-established methods, such as thresholding (Loddo et al., 2022;Olivoto, 2022) and deep learning (Ott & Lautenschlager, 2022;Schwartz & Alfaro, 2021), have been adapted for biologists with little or no coding experience, but their practicality is restricted when working with numerous species and structures. Thresholding can be inefficient when the seeds being measured are of a variety of sizes, surface structures and colours, causing optimal parameters to change, and even fail when seeds are either too small, glossy or have protruding structures. Deep learning-based segmentation requires a large amount of annotated data to train the network, which is limited by data availability and poses computational and time costs.
After segmenting the image, the next step involves extracting traits from each detected seed. While some applications for plant scientists are available as flexible tools to perform various morphological analyses, these require prior knowledge of Python (Gehan et al., 2017;Lürig, 2022). In contrast, most automatic seed measurement software provide users with only a few morphometric measurements (e.g. Tanabata et al., 2012;Zhu et al., 2021). Although there are options to automatically extract a vast array of features, these are mainly intended for seed identification classifiers (Loddo et al., 2022), rather than providing data that can be interpreted and visualised. Thus, current available tools do not adequately support users who work with a diverse range of taxa to enhance the characterisation of seed morphology and build standardised category systems for colour and shape.
Here, we propose an image analysis approach to enhance timeefficiency and objectivity of morphological trait extraction, with the capability to automatically handle seeds of different colours, shapes, and structures. We describe a simple image acquisition and processing protocol and introduce Traitor, an open-source software available as a command-line interface (CLI), which automates the extraction of seed size, shape, and colour from images. Its name alludes to two ideas: a software that carries out trait extractions, and one that 'betrays' the seeds (and other biological objects) by giving away their valuable information to users. Traitor uses unsupervised segmentation to automate the separation of seeds from high-contrast backgrounds, eliminating the need for training, manual fine-tuning, or thresholding according to the particularities of each taxon/structure. The software is capable of processing hundreds of images of varied taxa simultaneously with just a few steps, allowing for largescale analysis that would otherwise be unfeasible. It also provides means to easily verify the quality of image segmentation. We give detailed instructions for seed trait extraction, evaluate the accuracy of the protocol using images of 1228 taxa and provide case studies demonstrating how data extracted by Traitor can be used in ecological and evolutionary studies.

| MATERIAL S AND ME THODS
The workflow for seed trait extraction consists of sample preparation, image acquisition, image processing and trait measurement with Traitor software (Figure 1).

| Sample preparation
Seeds should be separated from debris, especially same-sized debris that cannot be numerically filtered. Appendages can be included with the following exceptions: (1) hairy appendages (e.g. pappus) as the static electricity on the scanner glass often displaces light, hairy structures, hindering image acquisition; (2) extremely fine structures (e.g. long hairs) because the algorithm for image segmentation may not handle fine structures blending into the background; (3) bent elongated appendages (e.g. bent awns), which may not be properly aligned resulting in incorrect size measurements. We recommend prior tests and adjustments for these three cases.

| Image acquisition
Images of seeds should be acquired observing the following requirements: (1) the background should be homogeneous and distinct in colour from the seeds; (2) lighting must be uniform and conditions should be the same across all images; (3) sharp shadows should be avoided, for example, by allowing some space between seeds and K E Y W O R D S diaspores, high-throughput phenotyping, image segmentation, interpretability, morphological description, seed morphology, seed traits, trait measurement background; (4) the scale must be known. The present protocol uses flatbed scanners covered with a frame approximately 10 mm thick with a royal blue background to obtain the images due to the convenience of meeting these requirements ( Figure 2; A HP Scanjet G4010 was used to build the dataset in Section 3). Seeds must be arranged without any contact or overlap, avoiding the edges of the scanner due to background shadows and the potential for lateral chromatic aberration (Matsuoka et al., 2012). Image resolution should be chosen based on the size of seeds: a higher resolution is needed to well-represent small seeds and irregular surfaces. All automatic correction functions provided by the scanner software should be disabled to ensure that the RGB values of the samples are not manipulated. It is recommended that resulting images are saved in TIFF or PNG formats, although the JPEG format is also acceptable provided the images are acquired with a high spatial resolution and chrominance subsampling is minimised.

| Image processing
Colour measurements should be standardised to optimise their reproducibility and allow comparison between measurements obtained with different equipment and set ups (Stevens et al., 2007).
Users can accomplish this by scanning a colour chart containing colour standards under the same settings as the seeds. The image of the colour chart is used to calculate a colour conversion matrix, based on a least squares fit in linear-RGB space, which should then be applied to images for optimal colour reproduction. A Spyder F I G U R E 1 Workflow overview for trait extraction.
F I G U R E 2 (a) Seeds arranged on the glass of the flatbed scanner. (b) The flatbed scanner and a wooden frame 10 mm thick with a royal blue background (c), which is used as a lid. (d) Image of seeds after image processing.
Checkr® 24 card (Datacolor, NJ, USA) was used to build the dataset in Section 3 (code available at https://github.com/rdayr ell/colour_ calib ration). Scanned images should be cropped to areas containing seeds, avoiding objects or shadows on the background. Save images in PNG format with no compression, to avoid quality loss.

| Software
Traitor is an image analysis software for automated extraction of measurements from images containing seeds on a high-contrast background. The software was built on Python open-source libraries, such as OpenCV and scikit-learn (Bradski, 2000;Pedregosa et al., 2011). It is capable of automatically processing multiple images simultaneously and measures all traits with only three commands ( Figure 1b), which are explained in detail below. All functionalities are accessible via CLI.

| Extract
Traitor uses k-means clustering (Pedregosa et al., 2011) to separate seeds from a high-contrast background without the need for training, fine-tuning or thresholding. The algorithm works by iteratively grouping pixels into a predefined number of clusters based on their similarity and assigning each pixel to the cluster with the closest centroid, effectively separating the foreground from the background. Once PNG files are placed in a directory, the 'extract' command segments the image, detecting each object. A binary mask of the image with the detected outline of the detected objects (e.g. seeds) is created and serves as input for the 'align' command. Traitor can also generate two optional outputs for convenience: (1) A bounding box can be created with or without background removal for every detected object within the image, which can then be used in machine learning approaches; (2) The contours of every detected object can be drawn on the image, allowing users to easily assess the quality of the segmentation. All output files are created in subdirectories (one for each image) within the chosen output directory.
If several images do not exhibit a satisfactory result in this step, users should optimise image acquisition observing the requirements for Traitor's optimal use (Section 2.2). If only a few problematic cases occur, users should first retry the 'extract' command with these images: as k-means relies on a random initialisation, additional runs can lead to satisfactory results. As a last resort to correct for specific problems, it is possible to edit the mask or the images to correct for minor problems (e.g. delete large debris from the image).

| Align
Next, the software aligns images. The first contour is aligned in an upright position according to its minimal bounding ellipse. All other contours are then rotated to best match the reference contour using Procrustes analysis, which finds the optimal translation, scaling and rotation that minimises the sum of squared differences between corresponding points on the contours. The Procrustes analysis is not applied to the raw contours, but instead to a lowcomplexity approximation of the shapes generated using elliptic Fourier analysis. This complexity reduction was found to increase the algorithm's ability to successfully align damaged or irregularly shaped objects.
The 'align' command creates three outputs in separate directories: (1) 'contours' contains csv files with xy coordinates of the aligned contour of each detected object in the image-these files are used by the 'measure' command to obtain morphometric traits and standardised shape values; (2) 'extractions' contains cropped images of aligned objects; (3) 'masks' contains binary masks of the aligned objects, which are used to extract colour measurements.

| Measure
The 'measure' command uses the output of the 'align' command to obtain several measurements relevant for seed characterisation (Table 1; Appendix S1). The aligned contours are used to calculate traditional morphometric descriptors. Contours are also used to extract standardised shape coordinates, which can then be used to build objective classification categories with high levels of visual interpretability (e.g. Victorino & Gómez, 2019). Cropped images and masks are used to obtain two types of colorimetric measurements: sRGB values suitable for human recognition purposes; and linearised sRGB values, which are correlated with reflectance values in the three broad-band parts of the spectrum (longwave, shortwave, mediumwave) and thus, useful for studies independent of any particular animal visual system (Stevens et al., 2007). Seed count is not directly measured but can be easily obtained by row counts.

| Installation and usage
Traitor is implemented in Python (compatible with versions 3.7 to 3.10) and should be installed and used on terminal (Apple and Linux) or Command Prompt/Powershell (Windows). The software and all its dependencies can be installed using the pip package installer. All commands provide a help option to list available parameters. Traitor source code, documentation and tutorial are available on GitHub (https://github.com/Tankr edO/traitor).

| VALIDATI ON AND C A S E S TUD IE S
We validate the method and demonstrate the application of the extracted traits using images from DiasMorph, a dataset containing images of Central European seeds and diaspores (Dayrell, Begemann, et al., 2023). Images were acquired and processed and used for trait extraction following the protocol outlined in Section 2. The dataset contains seeds of a wide range of taxa with lengths varying from 0.33 to 30.2 mm, in a variety of shapes, colours, and appendages.
All analyses using the extracted seed traits were performed in R (R Development Core Team, 2022) and are available on Github (https:// github.com/rdayr ell/Trait or_analyses).

| Validation of size measurements
To validate the results, we used images of 91,667 seeds from 1228 taxa (≥5 seeds per taxa; 89 plant families) from the DiasMorph dataset. We compared the average length and width values extracted by Traitor with average manual measurements obtained from seeds of the same collection (Supporting Information); a one-to-one correspondence between the seeds could not be established. For manual measurements, five seeds from each taxon were measured with the aid of a stereo microscope (Stemi SV 11; Carl Zeiss Jena GmbH, Germany), following standardised protocol (Kleyer et al., 2008).
The agreement between the measurements obtained by the two methods was assessed by calculating Lin's concordance correlation coefficient (ρ c ), an index of how well a new measurement reproduces a standard measurement that ranges between 1, perfect concordance, and −1, perfect discordance (Lin, 1989). For this, we used the 'CCC' function implemented in the 'DescTools' package (Signorell et al., 2022).
The ρ c for length and width measurements were 0.979 (95% CI respectively. This shows a high agreement with the reference values from Traitor's measurements. We manually checked the outlines and alignment outputs of the measurements with less than 95% agreement between both measurements and detected no issues with Traitor's extract and align outputs. As we have not obtained the same measurement from the exact same seeds, the detection of larger differences in a few cases can be attributed to measurements being carried out on a less representative sample of seeds (with a size bias) or different structures (e.g. seeds rather than diaspores).

| Colour measurements for human recognition
We used Traitor's output for taxa of the Rosaceae family within the DiasMorph dataset to obtain representative colours for taxa which can assist on tasks related to colour description for human recognition (see Appendix S2 for a detailed method description). Median sRGB values were used to describe colour variation among taxa: channel-wise median of seeds' median colours was calculated for each taxon; seeds' median colours were used in principal component analysis (PCA) for dimensionality reduction, and the colour corresponding to the lowest and the highest PC1 value were considered as minimum and maximum colour values for the taxon (Figure 4). The channel-wise medians of seeds' dominant colours were used to describe colour variation in seeds of each taxon (Figure 4; Appendix S3).

| Colour measurements for ecological and evolutionary investigations
Linearised sRGB values exhibit a linear response to changes in light intensity and are a convenient option to obtain colour measurements for ecological and evolutionary studies (Stevens et al., 2007).
In this case study, we sampled colour traits of taxa in the Asteraceae family within the DiasMorph dataset and assessed whether seed colour is correlated to phylogenetic relatedness (see Appendix S2 for a detailed method description). Briefly, we calculated the median value of linearised sRGB for each taxon and subsequently performed a PCA for dimensionality reduction. Pagel's λ was used to estimate phylogenetic signal in PC1 scores, which explained 93.6% of the variation. λ can vary from zero, no correlation between species, and 1, species' traits are distributed as expected under Brownian motion (Pagel, 1999). The λ for PC1 was 0.84, and the likelihood ratio test rejected the hypothesis of lack of correlation among species (p < 0.0001). The result is consistent with some taxonomic groups TA B L E 1 Measurements extracted by Traitor (see Appendix S1 for details).

Traitor output Description
Morphometric measurements having a greater propensity towards darker colours, and with multiple evolutionary transitions of colour traits ( Figure 5).

| Objective categorisation of seed shapes
We used records of Carex species within the DiasMorph dataset to demonstrate how the shape outline standardised for size invariance (Traitor's output), can be used for the construction of objective and interpretable categories. Briefly, outlines were aligned, represented by an Elliptic Fourier transform as quantitative variables (harmonics), which were then used in PCA for dimensionality reduction. Hierarchical cluster analysis was then used to determine the shape categories (see Appendix S2 for a detailed method description). We divided shapes into eight groups to exemplify how this approach can improve communication of subtle differences.
The shape closest to the centre of each cluster is the representative shape of the cluster (Figure 6b; see Appendix S4 for outline of all seeds within each cluster). We overlaid 95% data ellipses for seeds of six species to visualise differences in intraspecific variability in seed shape.
PC1 explained 70.7% of the variance of the set of harmonics, separating seeds by aspect ratio; PC2 explained 14.3% of the variance, separating seeds that are relatively symmetrical along their width from seeds that widen below the centre while tapering above the centre (Figure 6a,b). Intraspecific variability can greatly differ among species as evidenced by contrasting data dispersion of species: for example Carex paucifolia seeds were concentrated in only one cluster, while Carex atrata subsp. aterrima seeds were found in five different clusters (Figure 6c).

| DISCUSS ION AND CON CLUDING REMARK S
The image acquisition and processing protocol and Traitor, the open-source software, presented in this study were designed to drastically speed up the extraction of seed morphological traits.
The method aims to support scientists and practitioners who work with a wide range of taxa to obtain objective and interpretable measurements in a time and labour-saving manner. Traitor's main strengths are its high level of automation in the image segmentation process, and its ability to produce highly accurate and interpretable measurements regardless of the morphological attributes of the measured object. Additionally, Traitor allows users to easily inspect the quality of image segmentation outputs. Therefore, our approach provides a more convenient and automated alternative to extract interpretable traits from seeds with diverse morpho- Width manual (mm) Python (Gehan et al., 2017) and deep-learning based models (Ott & Lautenschlager, 2022  the interpretability of the extracted data can also be compromised.
Users should carefully evaluate the method's suitability for other applications and set-ups.
Traitor's output provides means for constructing category systems for seed shape and colour with high levels of visual interpretability, offering opportunities to improve the communication of seed morphology and advance in new identification tools. The approach can open new avenues in the field of seed ecology and evolution and increase data availability of seed morphology which are comparable across taxa.

AUTH O R CO NTR I B UTI O N S
Roberta Dayrell designed the study, conducted case studies, and wrote the manuscript. Tankred Ott programmed the software. Tom Horrocks programmed the colour calibration and advised on the use of colour spaces. Peter Poschlod provided his seed herbarium, conceived the project and obtained the funding. All authors contributed critically to the drafts.

ACK N O WLE D G E M ENTS
This research was funded by the European Regional Development

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors declare no conflict.