Are geometric morphometric analyses replicable? Evaluating landmark measurement error and its impact on extant and fossil Microtus classification

Abstract Geometric morphometric analyses are frequently employed to quantify biological shape and shape variation. Despite the popularity of this technique, quantification of measurement error in geometric morphometric datasets and its impact on statistical results is seldom assessed in the literature. Here, we evaluate error on 2D landmark coordinate configurations of the lower first molar of five North American Microtus (vole) species. We acquired data from the same specimens several times to quantify error from four data acquisition sources: specimen presentation, imaging devices, interobserver variation, and intraobserver variation. We then evaluated the impact of those errors on linear discriminant analysis‐based classifications of the five species using recent specimens of known species affinity and fossil specimens of unknown species affinity. Results indicate that data acquisition error can be substantial, sometimes explaining >30% of the total variation among datasets. Comparisons of datasets digitized by different individuals exhibit the greatest discrepancies in landmark precision, and comparison of datasets photographed from different presentation angles yields the greatest discrepancies in species classification results. All error sources impact statistical classification to some extent. For example, no two landmark dataset replicates exhibit the same predicted group memberships of recent or fossil specimens. Our findings emphasize the need to mitigate error as much as possible during geometric morphometric data collection. Though the impact of measurement error on statistical fidelity is likely analysis‐specific, we recommend that all geometric morphometric studies standardize specimen imaging equipment, specimen presentations (if analyses are 2D), and landmark digitizers to reduce error and subsequent analytical misinterpretations.

Despite its analytical advantages and broad utility, replicating GM results can be challenging due to the variety of research equipment used to image the samples, variation in specimen positioning, and variation in landmark digitization within and among operators, all of which can generate data discrepancies. Each phase of GM data acquisition can introduce a unique form of random and/or systematic measurement error. When compounded, these errors may lead to inconsistency among repeated measures and obscure the distinction between biological and artificial variation among specimens (Fruciano, 2016;Robinson & Terhune, 2017). Three general types of GM-based measurement error are acknowledged (methodological, instrumental, and personal), which can be subdivided into more specific error sources (Arnqvist & Mårtensson, 1998). Here, we address four sources of measurement error encountered during landmark data acquisition:

| Imaging device; error type: Instrumental
Use of different instruments for projecting 3D objects on 2D and 3D surfaces (e.g., digitizing tablets, digital images, and scanners) can generate dissimilar morphological reconstructions of original specimens (Arnqvist & Mårtensson, 1998). Variation can occur within equipment types as well. Camera lenses, for example, generate 2D image distortion based on the magnification of an object and its distance and position from the camera; the extent of image distortion varies among lens types due to factors such as lens curvature (Zelditch et al., 2004). The resolution of an image will also vary depending on the number of photodetectors in a camera (Zelditch et al., 2004); some anatomical loci may be obscured in lower-resolution images which can impact the precision of landmark placement.
Error facilitated by dimensional loss is specific to 2D GM analyses; however, other forms of instrumental error can occur in 3D systems as well (Fruciano et al., 2017;Robinson & Terhune, 2017). The configuration of landmarks placed on specimen images may, therefore, be inconsistent when different equipment is used and/or when data from different imaging protocols are combined.

| Specimen presentation; error type: Methodological
Operators digitizing specimens in two dimensions should be cautious of their presentations (i.e., the projected orientation of specimens) since some degree of distortion is usually unavoidable when projecting 3D objects. Differential shifting of three-dimensional features can be problematic in 2D systems because z-axes are not retained and, therefore, projected locations of landmark loci can be displaced relative to their true position among other loci (Buser, Sidlauskas, & Summers, 2018;Cardini, 2014;Zelditch et al., 2004). Effects of such displacement can be exacerbated if landmark loci shift toward the edges of a camera field where image distortion is greatest (Fruciano, 2016;Zelditch et al., 2004). If all objects are projected from similar orientations and with the same equipment, any projection distortions should be similar among specimens and are thus unlikely to generate substantial artificial variation. If presentations are dissimilar among species, however, associated interspecimen variation in landmark coordinates may appear biological in downstream analysis when it is in fact artificial. Presentation error may be particularly substantial in situations where interspecimen orientations are difficult to standardize (e.g., when comparing within-cranium teeth of recent specimens to isolated teeth of fossil specimens).

| Interobserver error; error type: Personal
After specimens have been selected, presented, and projected, error can occur during landmark digitization. For example, one individual may position a landmark differently than another individual, even when digitizing the same locus of the same specimen. Error among landmark digitizers is referred to as interobserver error.

| Intraobserver error; error type: Personal
Digitizing error occurs within observers as well. An individual may place a landmark on a locus differently from one specimen to another or from one digitizing session to another. This is referred to as intraobserver error. Intra-and interobserver error can be affected by factors such as variation in digitizing experience among observers, the number of digitizing sessions conducted per observer, and ease of landmark loci visualization (Fruciano, 2016;Osis, Hettinga, Macdonald, & Ferber, 2015). Observer errors may be exacerbated by variation in specimen projection and/or presentation as well.
Measurement error introduced at various phases of landmark data acquisition can be substantial; however, their collective impacts on GM analyses are underreported (Fruciano, 2016). It is not uncommon for studies to report inter-and intraobserver error (e.g., Dujardin, Kaba, & Henry, 2010;Gonzalez, Bernal, & Perez, 2011;Nicholson & Harvati, 2006;Ross & Williams, 2008), possibly because landmark digitization does not require presentation and projection replications and because it can be conducted any time after projected specimen data collection. Personal digitizing errors are therefore more convenient to quantify than most other error types (Fruciano, 2016). Quantification of presentation and projection error requires replication which generally must be conducted at specimen housing facilities, and is seldom assessed in the literature (but see Fruciano, 2016;Fruciano et al., 2017;Robinson & Terhune, 2017) though their potential for obscuring biologically meaningful shape variation is considerable (Fruciano, 2016;Zelditch et al., 2004). Few studies demonstrate how several of these error types can combine to impact statistical result-based inferences (but see Fruciano et al., 2017Fruciano et al., ,2020Robinson & Terhune, 2017;Vergara-Solana, García-Rodríguez, & Cruz-Agüero, 2014). This context is important because ecological, archeological, and paleontological studies often use statistical grouping analyses (e.g., linear discriminant analysis (LDA)/ canonical variate analysis) to determine the taxonomic or ecological affinity of unknown specimens (Kovarovic, Aiello, Cardini, & Lockwood, 2011;Webster & Sheets, 2010). Despite the frequency of GM-based classification analyses in the literature (e.g., Baumgartner & Hoffman, 2019;Cassini, 2013;Curran, 2012;De Meulemeester et al., 2012;Gómez Cano et al., 2013;Wallace, 2006), the impacts of multiple sources of measurement error on statistically derived group membership predictions are largely untested.
Here, we evaluate the relative contribution of GM measurement error from different landmark data acquisition sources and their impact on LDA group membership predictions. We specifically quantify error introduced from four sources-specimen presentation, specimen imaging devices, interobserver digitization, and intraobserver digitization-and determine how the accuracy and replicability of 2D landmark-based identifications of five closely related extant species, and the predicted group membership replicability of congeneric specimens of unknown species affinity, are affected by each error source. For this study, we define "replicable" as achieving the same group membership predictions of individual specimens among repeated data acquisition iterations. We do this so future researchers classifying specimens via landmark analysis are aware of (a) the data acquisition sources that may introduce non-negligible amounts of measurement error and (b) the precautions that can be employed to mitigate those errors and their impacts on statistical results.

| Study system
Recent work on observer and method-based GM error suggests that error may have a substantial impact on statistical results when variation among similar (e.g., intraspecific) groups is analyzed because morphological differences among those groups are likely to be subtle (Robinson & Terhune, 2017). Thus, artificial variation introduced via GM error may be more likely to impact classification statistics, and bias subsequent inferences of biological variation, when amonggroup variation is low (Robinson & Terhune, 2017 to their habitat specificity and ubiquity in modern and prehistoric biotic assemblages (Bell & Bever, 2006;Bell & Repenning, 1999;McGuire, 2011;Smartt, 1977;Wallace, 2019). However, identifying voles to species is challenging due to high morphological variability, high diversity, and sympatry of species throughout much of North America (Barnosky, 1990;Bell & Bever, 2006;Smartt, 1977;Wallace, 2006). Over the past two decades, more advanced research techniques including landmark-based LDA of Microtus lower first molars (m1s) have improved vole species identification accuracy (Wallace, 2006), but it is still imperfect when applied to study regions such as western North America due to marked geographic range and shape overlap among the many members that reside there today (McGuire, 2011). Western North American voles are therefore an appropriate system for evaluating GM measurement error on classification statistics when intergroup variation is low.

| Study design
We replicated 2D digital specimen images (n = 247) and m1 landmark configurations (n = 21 landmarks, Figure 1 Each species group thus meets ideal LDA conditions that (a) predictor variables (i.e., x and y Cartesian landmark coordinates, n = 42) do not exceed n of the smallest group and (b) that group samples sizes are approximately equal (Kovarovic et al., 2011). Each phase of landmark data acquisition (i.e., specimen presentation, specimen imaging, and inter/intraobserver digitization) was repeated to quantify error from those sources. Our study design for quantifying error from each source was as follows:

| Imaging device
We assembled two datasets using specimen images obtained from two different cameras to evaluate interinstrument variation (hereafter "imaging device" or simply "device" variation). The first image

| Specimen presentation
After an initial Dino-Lite photograph was taken, each Microtus specimen was tilted haphazardly along its anteroposterior and/or labiolingual axis and rephotographed with all landmark loci still visible. This was done to simulate specimen orientation changes that may occur when comparing dissimilar specimens such as in situ teeth and isolated teeth. That scenario is not uncommon when comparing fossil specimens to recent specimens since complete preservation of fossilized craniodental remains is rare. When loose m1s were available from recent Microtus specimens, those teeth were photographed in isolation rather than in situ during this iteration. We note, however, that intentionally tilting specimens potentially exacerbates presentation error relative to the amount of error typically introduced when specimen orientations are standardized. The intent of this modification is to quantify potential presentation error rather than expected error since presentation error will vary by study (Fruciano, 2016).

| Inter/intraobserver error
To quantify observer variation, the original Nikon Microtus m1 images and Dino-Lite resampled images were digitized by two observers using the 21-landmark protocol of Wallace (2006) and McGuire (2011) (Figure 1). Those observers also allowed us to evaluate methodological experience, a variable suggested (but rarely tested) to impact the magnitude of observer error (Fruciano, 2016). It is perhaps expected that experience will reduce digitizing error, but recent studies have shown that is not always true (e.g., Engelkes et al., 2019), thus warranting its quantification here. One observer, hereafter referred to as the experienced observer (EO), had previous experience conducting 2D landmark analyses at the time this study was initiated while the other observer, hereafter referred to as the new observer (NO), did not. Each image set was then digitized a second time by the EO and NO with at least 1 week between iterations to evaluate intraobserver variation on landmark placement.

| Data preparation
Nine unique landmark datasets were assembled in total to evaluate measurement error from the four focal data acquisition sources.
First, Nikon and Dino-Lite image sets were assembled to quantify imaging device variation. Those image sets were digitized twice by each observer to evaluate inter-and intraobserver error (two image sets and two digitizing iterations per observer = eight datasets, Figure A1). A "tilted" Dino-Lite image set was then assembled and digitized by the EO to quantify data variation due to changes in specimen presentation resulting in a total of nine datasets. All image sets were assembled and digitized using TpsUtil 32 (Rohlf, 2018a) and TpsDig 2.32 (Rohlf, 2018b) software, respectively. Each landmark dataset was superimposed via GPA to standardize effects of rotation, orientation, and scale among specimens using the gpagen F I G U R E 1 Lower left first molar occlusal surface of MVZ-132727 Microtus californicus illustrating the 21-landmark configuration used to quantify shape variation among extant and fossil Microtus species. See Wallace (2006) for landmark definitions function in the R package "geomorph" (version 3.1.3, Adams, Collyer, & Kaliontzopoulou, 2019). During GPA, all specimens are translated to the origin, scaled to unit-centroid size, and optimally rotated via a generalized least-squares algorithm to align them along a common coordinate system (Rohlf & Slice, 1990).

| Quantifying measurement error
We ran Procrustes ANOVA models using the procD.lm function in geomorph to analyze source-specific variation in the nine GPAtransformed landmark datasets. Analyses were conducted on 22 unique pairwise dataset comparisons (see dataset comparison names in Table 1 for specific comparisons) and cumulatively across eight datasets using the following nested hierarchal levels: species > individuals >imaging device > interobservers > intraobservers ( Figure   A1). Specimen presentation was only evaluated via pairwise comparison of tilted versus nontilted Dino-Lite datasets because tilted presentations were not included in the original Nikon-image study design of McGuire (2011). All Procrustes ANOVAs were conducted using a residual randomization procedure with 999 iterations. Dataset comparisons were grouped according to the specific data acquisition iterations they encompassed using a four-part naming system. For example, "Dinolite_NoTilt_EO_T1" indicates that (a) the images were photographed with a Dino-Lite camera, (b) the images were not randomly tilted, (c) images were digitized by the EO, and (d) it was the EO's first digitizing iteration. Thus, pairwise comparison of datasets "Dinolite_Notilt_EO_T1" and "Dinolite_Notilt_EO_T2" quantifies intraobserver variation between digitizing iterations one and two for the EO since all other source components are equivalent. In addition, we calculated repeatability among our datasets using pairwise Procrustes ANOVA mean squares based on the protocol and equations of Arnqvist and Mårtensson (1998) and Fruciano (2016).
Repeatability quantifies the variability of repeated measurements within the same samples, in this case the resampled Microtus datasets, relative to the variability among samples, in this case the biological variation among specimens, on a zero to one scale. Values closer to one indicate higher repeatability, and values closer to zero indicate lower repeatability (Arnqvist & Mårtensson, 1998;Fruciano, 2016).

| Quantifying measurement error impacts on classification statistics
To determine how source-specific measurement error impacts Microtus species classification, we ran LDAs on each of the nine GPA-transformed landmark trial datasets using the lda function in the R package "MASS" (version 7.3, Venables & Ripley, 2002). Fortytwo x, y coordinates from the 21 digitized landmarks were used as predictor variables to classify each specimen into a predicted species group. We used leave-one-out cross-validation to determine the percentage of specimens correctly classified within their respective species groups since it reduces standard LDA group overfitting (Kovarovic et al., 2011). Prior probabilities of group membership were assigned using the default lda argument based on the propor-

| Species occurrence likelihood
Since LDA of western North American vole species is <100% accurate (McGuire, 2011), it may be difficult to determine whether some PGMs are "real" or altered by error within the LDA training set, especially when the number of individuals assigned to a species group is small. Therefore, we consider a species occurrence "likely" if the percentage of unknown individuals classified to a species group, relative to the total number of individuals within the unknown dataset, exceeds the percentage of cross-validated classification error within the LDA training set. For example, a dataset that misclassifies 40 of the 247 recent Microtus specimens (16.2%) must assign more than 16.2% of the total unknown specimens to a particular species group for that species to be considered "likely present." Thus, if 15 of the 31 unknown specimens were assigned to M. californicus (48.3%), four were assigned to M. montanus (12.9%), and two were assigned to M. townsendii (6.5%); only M. californicus would be considered "likely present" since the percentage of specimens assigned to the other two species falls within the range of classification error for that dataset (see Results). To further vet false occurrences, we removed all fossil individuals with PGM posterior probabilities <.95 prior to these likelihood calculations.

| Source-specific variation
Pairwise and nested Procrustes ANOVA of the landmark datasets show that all potential error sources generate significant error (Tables 1 and   2). Of the four sources, interobserver error is the most substantial and explains ~27% of the variation among datasets, on average, in pairwise comparisons (see R 2 values, Table 1). The next greatest sources of error in the pairwise comparisons are seen among specimen presentations and imaging devices, both of which explain ~20% of the variation on average, followed by intraobserver error which explains ~12% of pairwise variation on average (Table 1). The combined model shows similar patterns for different contributions to variation (Table 2): Interobserver error accounts for ~21% of the variation, whereas device and intraobserver error explain ~8% and 10% of among-dataset variation, respectively, in the nested Procrustes ANOVA model. These data acquisition error sources together explain ~39% of the total variation among datasets, while biological variation among individuals and TA B L E 1 Pairwise analysis of landmark datasets comparing Procrustes ANOVA residual R 2 percentages (ProcANOVA R 2 [%]) and repeatability among datasets, absolute differences among comparisons in cross-validated linear discriminant analysis predicted group membership error (PGM Error Change [%]), and differences among comparisons in the percent of predicted group membership changes among individual Project 23 fossils of unknown species affinity (Fossil PGM Change [%]) Note: Datasets are paired according to the respective error sources they quantify. Analyzed levels (i.e., error sources) of each comparison are bolded, and mean differences among datasets for each level are italics. Dataset name segments indicate the following: Dinolite = images photographed with a Dino-Lite camera are included; Nikon = images photographed with a Nikon camera are included; NoTilt = specimens photographed from a standardized orientation are included; Tilted = specimens photographed from haphazardly tilted orientations are included; EO = landmark configurations digitized by the experienced observer are included; NO = landmark configurations digitized by the new observer are included; T1 = landmark data from the first digitizing iteration of the respective image set and observer are included; and T2 = landmark data from the second digitizing iteration of the respective image set and observer are included.
species explains ~53% and ~8% of the total variation, respectively (  analysis also vary among datasets, even among datasets collected with the same imaging device and digitized by the same observer (Table A2).

| Predicted group membership replicability
Predicted group memberships of unknown Microtus fossils from Project 23, Deposit 1, at Rancho La Brea vary substantially among the nine landmark datasets (Tables 1 and 3; Table A1). Individual fossil PGM discrepancies range from 9.7% to 45.2% among the 22 pairwise dataset comparisons (Table 1). As with differences in PGM error of recent specimens, PGM differences among the unknown fossils is greatest between trials of differential presentation (mean PGM variation = 43.6%) followed by different imaging devices (32.7%), different observers (29.5%), and within observers (22.6%; Table 1). Microtus californicus is the most frequently assigned species; the number of fossil individuals classified as M. californicus with predicted probabilities >.95 ranges from three to 20 among datasets (Table 3). Microtus californicus is considered "likely present" in four of the nine datasets according to our likelihood criterion. Microtus longicaudus is the second most frequently assigned species; 0 to eight individuals are classified as this species within datasets after probability vetting, and it is considered "likely present" in one of the nine datasets (Table 3). Individual specimens assigned to the other three species range from 0, 0-2, and 0-1 for M. montanus, M. oregoni, and M. townsendii, respectively, after probability vetting. None of those species are considered "likely present" in any dataset using our likelihood criterion (Table 3) respectively, using this procedure (Table A1).

| Landmark data acquisition error
We have shown that error introduced from various landmark data acquisition sources can be substantial and, in some cases, explains >30% of nonbiological variation among datasets (Tables 1 and 2). This is concerning for geometric morphometric analyses aiming to quantify shape change among biological groups-including studies of taxonomy, functional ecology, and population history-because large amounts of error may impact hypothesis testing outcomes and/or lead to erroneous interpretations of focal-group relationships (Buser et al., 2018). It is therefore necessary to identify sourcespecific causes of error and establish protocols to mitigate error as much as possible.
We find interobserver error to be greatest among datasets, overall, followed by error among specimen presentations, within observers, Note: The "Dinolite_Tilted_EO_T1" dataset was not included since it did not fit into the nested analytical hierarchy ( Figure A1).

TA B L E 2
Nested Procrustes ANOVA summary statistics of variation attributed to biological factors (i.e., among species and individual specimens) and three data acquisition error sources: imaging device, interobserver digitization, and intraobserver digitization across eight landmark datasets and among different types of imaging equipment (Tables 1 and 2; Figure 2), though the relative importance of device versus intraobserver error differs among pairwise versus nested Procrustes ANOVA analyses (Tables 1 and 2). Such discrepancies are not unexpected since intraobserver error is an inextricable component of device error, presentation error, and interobserver error in pairwise analysis, and therefore, its impact is best captured by the nested analysis. In all cases, variation attributed to data acquisition error sources is less than biological variation among individuals but greater than or approximately equal to variation among species (Tables 1 and 2). These findings generally agree with quantifications of interobserver and device error in other studies (e.g., Fruciano et al., 2017;Robinson & Terhune, 2017). For example, Fruciano et al. (2017) found interobserver error to be greater than device error or biological asymmetry, explaining up to 10.2% and 5.4% of the variation within datasets, respectively.
Similarly, Robinson and Terhune (2017) found observer error to be the greatest nonbiological source of variation. Interobserver variation in our study may have been exacerbated by differences in digitizing experience among operators, and device error may be elevated by residual presentation error despite controlling for this. Indeed, pairwise Procrustes ANOVA of datasets digitized by the EO and NO yielded considerable differences in R 2 and repeatability, with less error and higher repeatability of datasets digitized by the EO, suggesting that experience does reduce digitizing error here. Mean intraobserver R 2 is 9.84% and 13.72%, and repeatability is 0.72 and 0.61, for the EO and NO, respectively (Table 1). Changes in specimen presentation yield the second greatest amount of landmark data variation in pairwise comparisons (Table 1). Although presentation error may have been exacerbated by the intentional titling of specimens in our treatment, Fruciano (2016) also found presentation-based variation to be significant and substantially greater than intraobserver error on 2D landmark configurations of fish body shape.  Table 1). Data of unknown specimens may be especially sensitive to measurement error because they are often acquired, appended, and/or analyzed only after a meaningful group-binning protocol has been established among training groups (e.g., Cassini, 2013;De Meulemeester et al., 2012;Figueirido, Martín-Serra, Tseng, & Janis, 2015; this study). Thus, data acquisition processes and their associated errors may be repeated during unknown specimen data collection, which may exacerbate the amount of artificial variation present among unknowns relative to specimens in the training set. For example, replicating the orientation of recent specimen teeth projected from within jaws could be difficult when projecting isolated teeth of fossil specimens. Indeed, our data show that orientation changes among specimens captured in 2D images can profoundly impact recent and fossil specimen classification statistics (Table 1; Figure 3b,d). Differences in digitizing personnel and/or imaging instruments used to obtain recent and fossil specimen data could accentuate those errors.

| Impacts on group classification statistics
Data acquisition error is not only problematic for evaluating the number of specimens classified within a group, but it can also lead to erroneous inferences of taxonomic occurrences at sites when PGM is performed on specimens of unknown taxonomic affinities.  (Table A1). Increasing sample sizes of training and unknown specimen groups, rerunning analyses on multiple landmark iterations, and employing error-based occurrence and PGM probability vetting can help elucidate which group occurrences are likely real and which are likely attributable to nonbiological error sources (e.g., Table 3). Even with those measures, however, the presence of some groups may be uncertain depending on analysis-specific intergroup similarity and PGM accuracy. For example, the few individuals assigned to M. montanus, M. oregoni, and M. townsendii must be viewed with skepticism since they fall within the range of cross-validated PGM error of all recent specimen training sets using our species likelihood criterion (Table 3). However, M. longicaudus is considered likely present in one or eight of the nine landmark datasets depending on which likelihood criterion is used (Table 3, Table A1). This is interesting since an isolated, high-elevation population of M. longicaudus is present in the San Bernardino mountains today (Patterson et al., 2003). Nevertheless, the occurrence of M. longicaudus in Deposit 1 at Rancho La Brea is uncertain until a larger fossil sample size is acquired.

F I G U R E 2
Thin-plate spline deformation grids illustrating mean shape changes between reference dataset and target dataset landmark configurations of Microtus lower first molars. (a) Specimen presentation impacts on overall landmark coordinate shape between datasets "Dinolite_Notilt_EO_T1" and "Dinolite_Tilt_EO_T1." (b) Imaging device impacts on overall landmark coordinate shape between datasets "Nikon_NoTilt_NO_T2" and "Dinolite_NoTilt_NO_T2." (c) Interobserver impacts on overall landmark coordinate shape between datasets "Dinolite_NoTilt_EO_T2" and "Dinolite_NoTilt_NO_T2." (d) Intraobserver impacts on overall landmark coordinate shape between datasets "Dinolite_NoTilt_EO_T1" and "Dinolite_NoTilt_EO_T2." See Table 1 for dataset name explanations

| Relationships among error proxies and data replicability
Our results indicate that some landmark data acquisition sources contribute relatively large amounts of variation across error proxies (e.g., interobserver error quantified by Procrustes ANOVA and LDA; Tables 1 and 2; Figure 2). However, all error sources are significant and impact classification statistics of recent and fossil Microtus specimens to some extent (Table 1; Figure 3). The fact that source-specific measurement error is significant, alone, does not indicate that it will substantially impact downstream classification results. For example, Fruciano et al. (2020) found significant differences in fish body shape attributed to different preservation treatments. However, the impact of preservation treatment on LDA fish-group classification was minimal. The authors attributed that discrepancy to differences in shape change direction between the fish groups of interest and preservation-based error (i.e., shape change due to preservation and biological shape change was not parallel in that system; Fruciano et al., 2020).
In our study, extant Microtus PGM accuracy and consistency generally align with Procrustes ANOVA variation such that pairwise F I G U R E 3 (a) Boxplot of linear discriminant analysis predicted group membership error percentages using leave-one-out cross-validation across all extant Microtus species for each dataset in this study (n = 9) grouped by observer, imaging device, and specimen orientation. See Table 3 for error values generated from individual datasets. (b-d) Plot of linear discriminant functions one (LD1) and two (LD2) from a subset of the nine landmark datasets: (b) EO intraobserver mean of coordinates "Dinolite_NoTilt_EO_T1" and "Dinolite_NoTilt_EO_T2," (c) NO intraobserver mean of coordinates "Dinolite_NoTilt_NO_T1" and "Dinolite_NoTilt_NO_T2,"and (d) "Tilted" specimen presentations "Dinolite_ Tilt_EO_T1." All 247 recent Microtus individuals are grouped according to species affinity: "Mc" = Microtus californicus, "Ml" = Microtus longicaudus, "Mm" = Microtus montanus, "Mo" = Microtus oregoni, and "Mt" = Microtus townsendii. "Error" = the percentage of cross-validated predicted group membership error across all species comparisons of datasets with lower R 2 values and higher repeatability values exhibit greater PGM agreement (  Figure 4b). The latter trend is possibly due to the additional data acquisition phases, and thus greater error potential, inherent of classifying unknowns as mentioned. One notable exception to the overall trend of recent and fossil specimen pairwise data is observed in presentation-based error.
Major discrepancies occur, however, in PGM accuracy of recent specimens and PGM affinity of fossil specimens between tilted and nontilted datasets (  Figure 4a). In other words, measurement error attributed to different devices and within observers does not have as strong of an effect on classification results as measurement error between specimen presentations and among observers. This suggests that the direction of biological shape variation among Microtus species is more dissimilar to the direction of artificial shape variation attributed to device and intraobserver differences than it is to presentation and interobserver differences.
While reduced pairwise PGM error discrepancies among devices and within observes may be caused by differences in biological and artificial shape change directionality, elevated pairwise PGM error among different specimen presentations relative to other error sources could be explained, in part, by the different proxies used to quantify error since pairwise Procrustes ANOVA comparisons quantify landmark precision, and LDA PGMs quantify landmark accuracy.
However, the inconsistency of presentation error quantified among those proxies is far greater than that of any other error source (Table 1, Figure 4) so it is unlikely that this is entirely explained by the different proxies of error. Another possible explanation for the presentation error discrepancy observed in our study is image distortion-facilitated changes in specimen landmark configurations.
Rotational changes among 3D specimens in "tilted" trials may distort certain tooth loci captured in 2D images. Such distortions would then displace subsequent landmarks on those loci. Although orientation changes among landmark configurations are mitigated during GPA, the generalized least-squares algorithm that aligns the configurations to a common coordinate system does not adjust error based on individual point variation. Rather, corrections are distributed randomly across the entire configuration to reduce residual variation of less precise landmarks and increase variation of more precise landmarks to minimize error overall (von Cramon-Taubadel, Frazier, & Lahr, 2007). This "spreading" of landmark coordinate error during GPA, termed the "Pinocchio effect" (Chapman, 1990;von Cramon-Taubadel et al., 2007), may alter total specimen shape and thus biological variation among specimens captured via landmarks. for LDA since variables that maximize among-group separation are preferentially selected (Kovarovic et al., 2011). Distributing the error of highly variable landmarks (e.g., those facilitated by presentation inconstancies) across all landmarks may reduce overall error quantified by analyses of variance, but also inhibit the discriminatory power of classification analyses since the most significant among group-separating variables may be altered by doing so. Indeed, stepwise LDA indicates that variables selected for discriminating Microtus species are dissimilar among tilted versus nontilted trials (Table A2). It is perhaps relevant that differences between variables entered in stepwise LDA of tilted versus nontilted trials often occur among landmarks positioned on tooth extremities (e.g., Landmarks 3, 13, and 21; Figure 1; Table A2), which are closest to the image edges where distortion is generally greatest (Zelditch et al., 2004).
Overall, these results indicate that GM classification results of morphologically similar taxa are not always replicable due, in part, to multiple sources of data acquisition error. No two iterations among the nine resampled specimen datasets of this study exhibit the same intragroup classification results, and many datasets yield dissimilar predictions of fossil species occurrences (Table 3, Table A1).
However, our findings may be different from those of other studies since the impact of measurement error on data replicability will likely vary based on analysis-specific objectives, inter-and intragroup similarity, and statistical classification accuracy (Robinson & Terhune, 2017). For example, small to moderate amounts of measurement error may be negligible for studies classifying organisms at the family level because among-group biological variation may surpass any artificial variation introduced to that system. Similarly, small amounts of measurement error and classification inaccuracy may be acceptable for quantifying interspecific occurrences, but not for quantifying intraspecific abundances within the same system. The amount of introduced error that surpasses an acceptable threshold will likely vary case-by-case depending on the respective analytical design, focal system, and questions/objectives of the study (Fruciano, 2016;Robinson & Terhune, 2017). Nevertheless, there are general measures that can be taken to mitigate error in any system.

| Mitigating error and error impacts
It is impossible to eliminate GM error completely (Fruciano, 2016), but there are several ways to lessen the amount of error introduced.
Presentation error, for example, has the most egregious impact on group classification replicability in our study. Although this error may have been exacerbated by intentional tilting of specimens in the "tilted" data acquisition trial, our findings indicate that presentation error can impact landmark-based classification statistics considerably if not properly managed (Table 1; Figures 3 and 4). Of further concern is the fact that this error source is less detectable through common error-quantifying methods (i.e., Procrustes ANOVA) than other data acquisition sources that introduce large amounts of error (e.g., interobservers) (Table 1; Figure 2), possibly due to the Pinocchio effect of GPA. Presentation error can be mitigated by using 3D GM which bypasses error associated with dimensional loss (Buser et al., 2018;Cardini, 2014). Three-dimensional GM technology has improved greatly over the past two decades with respect to its data resolution and cost (Cardini, 2014). High-resolution 3D analyses that were previously restricted to larger specimens are becoming increasingly applicable to small objects (e.g., Cornette, Baylac, Souter, & Herrel, 2013), such as the Microtus molars evaluated in our study. However, 2D GM will be more feasible for some projects since it is generally more affordable and can be conducted faster and with more versatile analytical equipment than 3D GM (Cardini, 2014). Researchers interested in conducting 2D GM analyses should therefore standardize specimen projection orientations as much as possible to mitigate presentation error.
For 2D and 3D GM analyses, we recommend that researchers avoid mixing observers due to the considerable amount of digitization error that can be generated among them (Tables 1, 2; Figures 2 and   4). After such precautions have been taken, determining the fidelity of statistical results, and/or whether the amount of error introduced is negligible, will be study-specific and dependent on intergroup data similarity and the overall accuracy of the analysis. Our findings suggest that, in general, groups with low numbers of unknown individuals assigned to them should be considered with caution, especially when classification accuracy and/or among-group variation is relatively low (Tables 2, 3, Figure 3, Table A1). Including relatively large sample sizes, posterior probability thresholds, and multiple (intraobserver) digitizing iterations may help infer group occurrence fidelity.
In conclusion, GM measurement error from different landmark data acquisition sources has the potential to obscure biologically meaningful shape variation, facilitate statistical misclassification, and negatively impact data replicability. We do not discourage using GM for biological group classification since it is among the most powerful techniques available for quantifying shape and shape variability among groups. Rather, we hope this study provides an informative, if cautionary, example of why GM error should be mitigated to the greatest feasible extent. After precautions have been taken to reduce measurement error, repeated measurements and statistical evaluations can be employed to facilitate decisions of whether the amount of residual error is acceptable for study-specific research objectives.

ACK N OWLED G M ENTS
We thank Jenny McGuire (Georgia Institute of Technology) for contributing the Nikon-photographed specimen images included in our analyses. We are also grateful for the specimen access and assis-  Table 3 without leave-one-out cross-valuation or posterior probability vetting  T1 T2 T1 T2   NO  EO   T1 T2 T1 T2