Insect morphometry is reproducible under average investigation standards

Abstract Morphometric research is being applied to a growing number and variety of organisms. Discoveries achieved via morphometric approaches are often considered highly transferable, in contrast to the tacit and idiosyncratic interpretation of discrete character states. The reliability of morphometric workflows in insect systematics has never been a subject of focused research, but such studies are sorely needed. In this paper, we assess the reproducibility of morphometric studies of ants where the mode of data collection is a shared routine. We compared datasets generated by eleven independent gaugers, that is, collaborators, who measured 21 continuous morphometric traits on the same pool of individuals according to the same protocol. The gaugers possessed a wide range of morphometric skills, had varying expertise among insect groups, and differed in their facility with measuring equipment. We used intraclass correlation coefficients (ICC) to calculate repeatability and reproducibility values (i.e., intra‐ and intergauger agreements), and we performed a multivariate permutational multivariate analysis of variance (PERMANOVA) using the Morosita index of dissimilarity with 9,999 iterations. The calculated average measure of intraclass correlation coefficients of different gaugers ranged from R = 0.784 to R = 0.9897 and a significant correlation was found between the repeatability and the morphometric skills of gaugers (p = 0.016). There was no significant association with the magnification of the equipment in the case of these rather small ants. The intergauger agreement, that is the reproducibility, varied between R = 0.872 and R = 0.471 (mean R = 0.690), but all gaugers arrived at the same two‐species conclusion. A PERMANOVA test revealed no significant gauger effect on species identity (R 2 = 0.69, p = 0.58). Our findings show that morphometric studies are reproducible when observers follow the standard protocol; hence, morphometric findings are widely transferable and will remain a valuable data source for alpha taxonomy.


| INTRODUC TI ON
The phenotype of organisms varies continuously during development and through evolutionary time. Continuous morphological variation is captured for numerous purposes in the life sciences via the practice of morphometry: the measurement of the size and shape of anatomical forms. Morphometry has yielded novel findings in evolution (Esquerré et al., 2020) and has been used to assess fluctuating asymmetry (Palmer, 1993;Klingenberg, 2015), ontogeny (Csősz & Majoros, 2009;Shingleton et al., 2007), ecomorphism (Anderson et al., 2019;Mahendiran et al., 2018;Tomiya & Meachen, 2018), and in human clinical practice (Bartlett & Frost, 2008). Among other applications, morphometric data are also key for alpha taxonomy, the discipline of formally differentiating and describing species and higher taxa. This is exemplified by the development of phenetics in the twentieth century (Michener & Sokal, 1957;Sokal & Sneath, 1963) and by numerous modern studies in other frameworks, such as for plants (Chuanromanee et al., 2019;Savriama, 2018), animals (Inäbnit et al., 2019;Villemant et al., 2007), and other organisms (Fodor et al., 2015;McMullin et al., 2018). Continuous data are also valuable, for modeling evolutionary histories (e.g., Parins-Fukuchi, 2017, 2020. Thus, the morphometric approach constitutes a fundamental and crucial practice for the study of phenotypes in biodiversity research. Morphology is traditionally considered to comprise both continuous and discrete traits (Artistotle, 350;Remane, 1952;Rensch, 1947;Thompson, 1917). Discrete states were established as the basic comparative units in animal alpha taxonomy from its formalization (Linnaeus, 1758) and have become a key means of scoring data for phylogenetic analysis, particularly after Hennig (1950,1966). The reproducibility of scoring discrete states is an issue; however, as qualitative perception of phenotype not only requires specific training and considerable experience but can also be plagued by arbitrariness (Bond & Beamer, 2006), meaning that variation may simply come from individual (mis-)interpretation. The qualitative approach commonly uses verbal species descriptions that are often subjective or difficult to articulate. Therefore, information transfer, if at all reliable, is based on one-to-one knowledge sharing mechanisms, and requires logically structured linguistic hierarchies such as the Hymenoptera Anatomy Ontology (Yoder et al., 2010).
In contrast to this relatively idiosyncratic approach, morphometry is considered transferable. It converts variation in the shape and size of anatomical traits, and number and arrangement of anatomical elements into numerical values, allowing for the dissemination of reproducible, phenotype-based knowledge. Today, an increasing number of morphology-based insect alpha-taxonomists use morphometric data and provide numeric keys to species Csősz et al., 2015;Seifert, 2018). If observers arrive at the same conclusion by measuring traits according to the same protocol, findings are believed to be reliable and transferable. If one can measure a trait, anyone else should be able to reproduce it.
All measurements are subject to error, however. Agreement among different observers and within a single observer's measurements are affected by a number of sources, such as the skills of the observer (if human input is required), the precision and accuracy of the equipment, clear interpretation and appropriate understanding of the character recording protocol, and other parameters. All of the uncertainty factors mentioned above are common in practice, and the fact that it is impossible to control every source of measurement variation challenges morphometry-based research (Wolak et al., 2012). Understanding of the degree to which measurement errors may affect the transferability of findings is urgently needed.
In order to address the question "to what extent is insect morphometry reproducible?," we compiled a broad database of morphometric data and performed robust statistical analyses. We used ants, a group in which the application of morphometric data has a long tradition as a model organism (e.g., Brian & Brian, 1949;Brown, 1943). Morphometry has been employed widely in recent myrmecological studies (e.g., Ward, 1999;Baroni Urbani, 1998;Seifert, 1992Seifert, , 2003Csősz et al., 2015;Wagner et al., 2017) as the primary method of interpreting anatomical forms and their variation. To evaluate reproducibity, we asked eleven participants to perform repeated measurements on the same set of ant specimens, using the same protocol, and with their own equipment. These participants, or gaugers, were from three continents and six countries, were of diverse levels of skill and expertise, and work with different taxonomic routines. The wide range of morphometric skills and the quality of microscopes used provided us with an overview of the level of reproducibility of morphometric interpretation as it works in daily practice. Our findings are a first step in exploring the reproducibility of morphometric data across entomology.

| The research objects
As an ideal stress-test basis for evaluating repeatability of morphometric studies in insect systematic research, we selected ten specimens each of a cryptic species pair, Nesomyrmex devius (Csősz & Fisher, 2016) and Nesomyrmex hirtellus (Csősz & Fisher, 2016), for a total of twenty ant specimens. Every trait under observation shows overlapping ranges (Seifert, 2009); thus, these species can be classified in multivariate fashion only. Today, cryptic species pairs are considered the most difficult cases and pose extraordinary challenges to systematic biology. Authorization for export was provided by the Director of Natural Resources.

| Gaugers
We addressed the question of whether or not the morphometric measurements performed by eleven gaugers ("measurers") could be considered repeatable based on statistical thresholds. Eleven

Box 1 Terminology
A number of terms (e.g., "accuracy," "precision," "reliability," "repeatability," and "reproducibility") commonly used in association with repeatability studies are defined differently in the literature. To increase the fluency of scientific discourse, we propose to adopt the standard terminology of the National Institute for Standards and Technology (NIST, Taylor & Kuyatt, 2001) of the United States and terms proposed by (Bartlett & Frost, 2008) in biological systematics: • Accuracy describes the average closeness of the measurement(s) to the value of the measurand (=subject or quantity to be measured) ( Figure 1). Accuracy is affected by systematic and random error. We follow the terminology proposed by the NIST in using the phrase "the value of the measurand" instead of the often-applied "true value of the measurand" (or "a true value") (Taylor & Kuyatt, 2001).
• Precision refers to the closeness of the measurements between pairs of measurements made on the same measurand and applying the same protocol. Precise measurements are tightly clustered, but are not necessarily accurate, that is, close to the value of the measurand ( Figure 1). Precision is affected by random error.
• Reliability refers to the amount of measurement error that occurs between observed measurements compared to the inherent amount of variability that occurs between measurands (Bartlett & Frost, 2008).
• Repeatability refers to the degree of agreement between repeat measurements made on the same measurand under the same conditions, that is, made by the same observer, using the same microscope, following the same measurement protocol (Taylor & Kuyatt, 2001). Repeatability can be assessed via intraclass correlation (ICC, see Lessells & Boag, 1987).
• Reproducibility refers to the degree of agreement between measurements made on the same measurand under changing conditions, such as changing principle, method of measurement, observer, instrument, etc. (Taylor & Kuyatt, 2001).
paper, but in order to provide the most important information regarding their skills and their equipment's quality, gaugers are coded in triad format as follows: expertise in field, estimated total number of specimens measured in their career, and the maximum magnification of the microscope used in the present study separated by underscores (e.g., MYRM_9000_100x).

| The morphometric character recording protocol
Gaugers were asked to measure 21 continuous morphometric characters in each specimen twice in order to collect data for testing both intragauger error, equivalent to repeatability, and intergauger error rate, equivalent to reproducibility. Every gauger was provided the same measurement protocol, including visual and verbatim trait definitions to follow ( Figure 2 and Table 1). The protocol was assembled based on an existing set of characters used in published papers (Seifert, 2006(Seifert, , 2018Csősz & Fisher, 2016;Schlick-Steiner et al., 2006;Wagner et al., 2017). In the current work, we addressed the question as to what extent random and systematic errors affect the rate of reproducibility. Therefore, all gaugers were encouraged to eliminate extraordinary differences due to gross error (occurring due to misreading, mistyping or erroneously set magnification) by comparing the values of the repeated observations.

| Data analysis
Distribution patterns of objects (i.e. specimens represented by 21 characters measured by the eleven different gaugers) were F I G U R E 1 Precision versus accuracy. The bullseye represents the value of the measurand. Accuracy is indicated by closeness to the bullseye-measurements closer to the bullseye are more accurate. Precise measurements are tightly clustered. Accurate and precise measurements are tightly clustered in the bullseye. Graphics produced and used with permission from Dr. Bethan Davies (antar cticg lacie rs.org)

Box 2 Sources of errors
Recognized sources of error in morphometry include three broad classes of observational errors: 1. Random errors, which occur irregularly and hence are unpredictable. Such errors arise in three different ways: random oscillations of the apparatus, mechanical vibrations, and minor positional changes of the subject at every single measurement. This type of error results in dissimilar outcomes, which can be detected by replicated observations. Random error primarily affects precision.
2. Systematic errors, which can be subdivided into (a) observational error, which arises from an individual's bias, unclear description of measuring procedures, lack of proper setting of the equipment, or false data recording due to parallax errors (Seifert, 2002); (b) instrumental error caused by factors such as imperfect calibration, etc., and (c) environmental error that can be ascribed to the effects of the external conditions on the measurements, for example, temperature, illumination, etc. Systematic errors primarily influence a measurement's accuracy, but these sources are predictable.
3. Gross errors, arising from false readings, mistakes in recording data by an observer (e.g., reading or recording 88 instead of 38), or mistakenly set magnification. This type of error seriously affects both precision and accuracy. This source of error can be eliminated by careful reading or recording. This type of error can also be recognized post hoc via comparing the repeated measurements in a pairwise matrix scatterplot (Baur & Leuenberger, 2011). displayed in a scatterplot via principal component analysis (PCA; Venables & Ripley, 2002) using a standardization to zero mean and the variance unit (Legendre & Gallagher, 2001). A Permutational Multivariate Analysis of Variance (PERMANOVA) was performed using the Morosita index of dissimilarity with 9,999 iterations (Anderson, 2001). The applied standardization technique reduces the one site/score in comparing to the average in the PCA (Borcard et al., 2011).
Reliability depends on the magnitude of the error in the measurements to the inherent variability between subjects. These measures of variability can be expressed as standard deviations (SDs).
Reliability is defined as a quadratic term of the measured values divided by the sum of the quadratic term of the measured plus the square standard deviation. It is formally described by Bartlett and Frost (2008) as (SD of subject's true values) 2 = (SD subjects' true values) 2 + (SD measurement error) 2 .
This measure of reliability is also known as intraclass correlation (ICC). If reliability is high, measurement error is small in comparison to the true differences between subjects, so that subjects can be relatively well distinguished (in terms of the quantity being measured) on the basis of the error-prone measurements (Bartlett & Frost, 2008).
To estimate the within-subject SD, we applied a one-way analysis of variance (ANOVA) to model the data containing the repeat measurements made on subjects. In addition, we also tested the effect of the gaugers' expertise and their equipment's performance on the accuracy of ICC estimation by using Spearman's rank correlation. The analyses were carried out in R 3.6.2 (R Core  Table 1 TA B L E 1 Verbatim trait definitions for morphometric character recording

| Agreement in classification between gaugers
The classification of the 18 pairs of independent observations made by eleven gaugers was successful for the two taxa according

| Reproducibility (intergauger agreement)
The Intraclass Correlation Coefficients (ICCs) indicated that the reproducibility of the examined 21 morphometric characters varied between R = 0.471 and R = 0.872 (mean R = 0.690) when the intergauger agreement was considered across the 11 gaugers (

| Repeatability (intragauger agreement)
A geometric mean of intraclass coefficients was calculated for every gauger in order to evaluate their personal performance in association with their skills and equipment quality. The calculated average measure of intraclass correlation coefficients of different gaugers ranged from R = 0.7840 to R = 0.9897 (Table 3)  These results indicate that both observer experience and better optical resolution in microscopes reduces measurement error and increases repeatability (Table 3).
In analyzing mean intragauger agreement character-wise, the mean ICC scores (  conditions as they were given in the gaugers MYRM_60000_360x and MYRM_5000_288x.
If mean trait size does not contribute much to the rather low ICC scores in the present study, these data are probably better explained by a combination of ten error sources as they were specified for stereomicroscopy by Seifert (2002). It is impossible to analyze which of these caused major disturbances in this study. All observers received verbal and picture-assisted character definitions (see Figure 2 and Table 1) but were given no further advice or protocols on how to minimize stereomicroscopic measuring errors. First, whether all observers avoided the parallax error is unknown. Second, whether all observers used an X-Y-Z-stage for spatial positioning of specimens (see Figure 1 in Seifert, 2002) and which position stability this stage had are also unknown. In spatial positioning, it is important to place the two endpoints of a measurement in the same visual plane, which is more accurate the lower the depth of focus or the higher the magnification of the optical system. Third, the performance and reliability (e.g., ratchet-step error) of the zoom microscopes used by gaugers in this study are unknown. Fourth, it is unknown how the observers made their readings (by one tenth of a graduation mark, by entire graduation marks, by digital read-out systems, etc.

CO N FLI C T O F I NTE R E S T
The authors declare no competing financial and nonfinancial interests.

DATA AVA I L A B I L I T Y S TAT E M E N T
Data available from the Dryad Digital Repository https://doi. org/10.5061/dryad.q83bk 3jfq.