When is enough, enough? Questions of sampling in vertebrate ichnology

Sample size is a challenge for most field scientists determined not by the statistically ideal, but by the available. In vertebrate ichnology, track length is an important variable correlating well with the track‐maker’s biology. It is also key to estimating the minimum number of individuals (MNI) present on a trampled horizon. Broad assumptions on biometrics of the track‐makers are often made based on a few prints without consideration for intra‐trackway variability. In this study we use a simple bootstrapping algorithm to explore variance within sample size for a range of trackways with fossil and experimental examples to determine the minimum sample size required to extract linear measurements. Predictably, experimentation shows that inter‐step variability changes with track‐maker and substrate, but the degree of variance is not as marked as previously anticipated. Change‐point modelling suggests that a maximum sample size of 22–25 captures most of the variance present in track length at least; another threshold at 7–10 has been identified, which represents the reasonable sample size minimum. Samples of fewer than seven tracks are subject to large amounts of potential variance and are unlikely to provide reliable and consistent measurements. These sampling thresholds hold across a wide range of depositional environments and track‐makers. We calculate generic standard errors for human track‐makers which may assist the practitioner with small samples to estimate the likely errors, especially when making MNI estimates. The challenge is placed to the wider vertebrate ichnology to explore this issue for other track‐makers and develop similar guidance.

G E O L O G I S T S , palaeontologists, archaeologists and bioanthropologists are pragmatic folk and have to work with what they discover even if it is never enough! This is especially true in vertebrate ichnology. Vertebrate tracks of all types and ages occur widely in the geological record from Middle Devonian (Stössel 1995;Niedźwiedzki et al. 2010;Stössel et al. 2016) to the near-present (e.g. Avanzini et al. 2011). Perhaps the biggest growth area in terms of discoveries has been with human tracks which were once considered a freak act of geological preservation (e.g. Leakey & Hay 1979), but a spate of recent discoveries has shown that this is far from true (e.g. Morse et al. 2013;Helm et al. 2018;Duveau et al. 2019;Bennett et al. 2020;Hatala et al. 2020). Human tracks allow us to make inferences about: occurrence (e.g. Morse et al. 2013;Bennett & Morse 2014;Altamura et al. 2018;Helm et al. 2018Helm et al. , 2020, biomechanics (e.g. Hatala et al. 2013Hatala et al. , 2016McClymont et al. 2016;Raichlen & Gordon 2017), stature and body mass (e.g. Dingwall et al. 2013;Domjanic et al. 2015) and, ultimately, behaviour and group demographics (e.g. Roach et al. 2016;Hatala et al. 2017). This also applies more widely to any other vertebrate tracks (e.g. Thulborn 1990), and relies on three things: the relationship of track depth to plantar pressure (e.g. Bates et al. 2013); the accuracy with which a track outline characterizes the foot of the track-maker in terms of size and shape (e.g. Gatesy & Falkingham 2017;Marchetti et al. 2019Marchetti et al. , 2020Falkingham & Gatesy 2020); and finding evidence for the contemporaneity of interacting tracks. While the latter can be achieved with one or two crosscut tracks, the former relies on the size of the sample of tracks, the variability within that sample and its representation of the foot, biomechanics, or behaviour of the track-maker.
The degree to which an individual track represents the shape of the foot and its biomechanical function is determined by the inter-step variability in footfall, variability in substrate in the direction of travel, and track taphonomy. There is also a component in any sample of interstep measurement precision. The biomechanics of each step has many moving parts as demonstrated by human biomechanics (e.g. Elftman & Manter 1935;Ker et al. 1987;Harcourt-Smith & Aiello 2004;Caravaggi et al. 2009) and variation in any one may be manifest in changes in plantar pressure and therefore in the distribution of track depth and maximal shape. No footprint is identical to another and will vary within a morphological envelope (e.g. Morse et al. 2013). The problem for ichnologists is that they rarely sample the full range of this morphological envelope due to being limited by the number of tracks that their excavation reveals or preserves, or in some cases the extent to which they are permitted to excavate (Fig. 1). This puts an intrinsic limit on the reliability of inferences such as the biometrics or biomechanics of the track-maker. Even if the structure of the trackmaker is known or easily approximable, the older literature on human ichnology is littered with dubious assertions about the characteristics of track-makers made on single tracks (e.g. Roberts et al. 1996;Roberts & Berger 1997). This is particularly true for stature, body mass and estimates of minimum number of track-makers (MNT or MNI). Webb et al. (2014) used an interesting approach in which they determine MNI based on: where the Length Range is the total range in footprint lengths, σ is the standard deviation and CI represents the confidence limit being used, typically 95% which would correspond to a value of 1.96 here. They determined σ from modern analogue studies. MNI estimates are also made by reference to determining length AE5% of the mean; if the two tracks exceed this 'magic' 5% then they are deemed to belong to two individuals. In practice this should be the 95% standard error (SE) of the mean, but an ichnologist is rarely able to sample the true variability of a population due to issues of preservation or exposure. Recourse to 'typical' SE using both modern analogue data and data from long fossil trackways, may help to mitigate these errors and provide better guidance to the ichnologist. We aim here to explore this variability and provide such guidance. To the trained statistician this may all seem obvious, embedded in the properties of the normal distribution, but we believe it is a timely and important reminder for field scientists who usually have to work with what they have.  Figure 1 show typical output for a couple of long human trackways. As one would expect, SE declines with increasing sample size. The challenge with such smooth curves is to determine point(s) at which further increases in sample size give marginal improvements in SE. Two approaches were used here. The first involved computing linear regressions and associate R 2 values for the whole sample and then progressively for N−1 until a minimum of N = 5 was reached. R 2 values improve with increasing sample size as the data tail becomes more linear and flatter. This provides an alternative way of visualizing the change in variance with sample size but does not identify any particular breaks in slope. The other approach involves using changing-point modelling. This was developed by Gallagher et al. (2011) to detect breaks in multivariate geochemical data within a borehole or core F I G . 1 . Trackway sampling curves. A, bootstrapped sample of track length for the White Sands National Monument (WHSA) double trackway (Bennett et al. 2020); as the sample size increases the variance falls. B, variation in standard error (SE) with sample size for the WHSA double trackway; note the wide variance within the 95% confidence area. C, variation in SE for a Namibian long trackway reported by Morse et al. (2013); the variance is much less within this trackway and demonstrates that the variance is potentially specific to each trackway. sample and is implemented here in PAST version 4.03 (Hammer et al. 2001). The algorithm is Bayesian, 'transdimensional' Markov chain Monte Carlo (MCMC) and produces not a single output but a large number of simulations derived from the distribution. Used on a single curve it produces a predicable result in which there is a gradual decline in the frequency of change-points identified (PAST function: see: Model/change-point). However, by using multiple data curves, equivalent to different geochemical proxies in its intended use, with each curve given equal weight it is able to identify change points which occur slightly more frequently than others. It therefore provides a way of synthesizing common breaks in multiple curves.

METHOD
Three types of input data were used in this analysis. Firstly, the last author (MRB) walked barefoot, at a similar constant speed, in four different substrate conditions exposed at low tide on the Conwy Estuary in North Wales, UK. A total of 50 footprints were photographed for each environment, rectified, and scaled in Photoshop before a simple maximum length estimate was measured in Photoshop. The second type of data was obtained by searching the human ichnological record for long trackways. Based on the Conwy dataset, a minimum trackway length of ten tracks was selected, although most are appreciably longer (Belvedere et al. 2021, tables 1, 2). This dataset includes tracks from a range of different sites: White Sands National Park (USA), Sefton Coast (UK), Walvis Bay (Namibia) and Engare Sero (Tanzania) and also small samples of early hominins from Ileret (Homo erectus, Kenya) and Laetoli (Australopithecus aferensis, Tanzania) (Belvedere et al. 2021, table 1). For the human/hominin trackways, data were either: (1) published length measurements; (2) bootstrapped from mean and standard deviations reported in the literature; or (3) measured directly from data curated by the authors. Finally, trackway data for both tridactyl (theropod) and sauropod tracks from Switzerland, South Korea, Portugal and China (Belvedere et al. 2021, tables 3, 4) were used. The same threshold of a minimum trackway length of ten tracks was applied for this dataset. Figure 2 shows the output from the neoichnological estuary experiment along with typical tracks made in each of the four sampled environments. The data has been computed for both the median and mean to highlight the differences. The tracks made in compact sand and silty sand (Fig. 2, numbers 1 and 2) show least variance as one might expect, while the highest variance occurs in a shallow mud, in which the foot tends to skate on the sublayer (Fig. 2, number 4). This latter environment gave a few extreme values; note the steep decline in the curve with increasing values of N. The soft, wet mud (Fig. 2, number 3) also has a high variance sustained over a longer range of values of N. Here variance is due to difficulty in determining the location of maximum digit position, although the deeper mud tends holds the foot more firmly preventing slippage. Given that the track-maker is: (1) the same individual for all trackways; (2) speed was constant; and (3) the method of measurement the same; the variance identified primarily reflects substrate. As Bates et al.

RESULTS
(2013) concluded, shallow tracks (Fig. 2, number 1) often contain the 'best data'. The regression analysis suggests that the decline in sample variance reaches a peak at N = 25 beyond which further sampling gives limited return. The change-point modelling reveals a similar conclusion in terms of the maximum sample size but also identifies peaks at N = 10 and N = 4. Both these thresholds show improvements in sample stability. One might tentatively conclude that across the four estuarine substrates, a track sample of N < 4 is going to be poor, 4 < N < 10 better, with 10 < N < 25 likely to be reasonable and samples where N > 25 ideal. If we consider the median rather than then mean, a similar four-fold division can be identified although the maximum sample size falls closer to N > 20, indicating that a slightly smaller sample is needed for estimates using median values. In addition, the shallower the tracks at the time of imprinting the smaller the sample that is probably needed, assuming complete preservation is achieved.
If we look at published data ( Fig. 3) for both experimental (mainly sandy beaches, or sand trays) and fossil cases, as one might expect the variance is much lower for the experimental trackways where taphonomic processes are excluded (e.g. Wiseman & De Groote 2018) and the substrate more homogenous, the track-maker known and walking with a constant pace. The fossil data reveals some interesting contrasts. Two trackways, possibly the longest in the world (Bennett et al. 2020), from White Sands National Park show significantly more variance than the F I G . 2 . Standard errors (SE; mm) plotted against sample size for four track samples made in four different substrates found at low tide on the Conwy Estuary in July 2020 (SH 79470 77361). A, the four SE curves associated with the median values with increasing sample size superimposed on the histogram of identified change points determined from these four curves plus their 95% confidence intervals (N = 12). B, SE curves associated with the mean values. There are three points, corresponding to the three peaks in each histogram, at which change-points are more commonly identified by the analysis. C, R 2 values for multiple linear regressions for a succession of samples each N−1; R 2 values fall between infinity and 1, with the latter being a near perfect data fit. Footprint scale bar represents 150 mm. other trackways. The track-maker moved over a flat surface with a uniform substrate at a steady speed. Conditions were similar to those of substrate four in the experimental Conwy data (Fig. 2), where the track-maker skated in softer mud above a less compressible sub-layer. Broadly speaking there is a continuum in variance between the mud-rich substrates and the sandier substrates, not dissimilar to that observed in the modern analogue studies. This data more clearly establishes the three zones of change (Fig. 3) picked out by the changepoint modelling in the Conwy data (Fig. 2), although the maximum sample size is closer to N = 20 rather than 25 and the intermediate point closer to N = 7.
To establish if locomotory behaviour of the trackmaker was a function in this analysis, data was assembled for a number tridactyl (theropod) and sauropod dinosaur tracks from a number of locations (Belvedere et al. 2021, tables 3, 4;Fig. 4). A total of 68 trackways are included in this analysis and show similar patterns to that found for human tracks. The level of variance is not dissimilar to that for human tracks, especially when one considers the difference in scale between some of these tracks.

DISCUSSION
Field scientists know inherently that more is usually better in terms of a sample (e.g. Kintigh 1989;Meltzer et al. 1992) and their challenge typically resolves around the availability of that sample. Our results confirm this obvious point, but also place a potential constraint on sample size for statistically significant ichnological studies. The data reported here appear to suggest, across a range of track-makers, substrates and measurement systems, that a sample size in excess of 22-25 for a single individual yields little gain in terms of minimizing variance within the sample. This threshold is reduced only slightly if the median is used (c. N = 20). Variance increases with decreasing sample sizes continuously, but our analysis suggests that it does so more significantly below a sample of seven tracks. Given that sample sizes are often small (e.g. Roberts & Berger 1997;Ashton et al. 2014) this is encouraging. The data also clearly show the risks of making track-length-based inferences (e.g. track-maker size, body mass) from tracks with samples of less than seven. These thresholds (Fig. 5) are to some extent artificial since the SE curves are continuous but provide a broad guide. Most field geologists have round numbers in mind when seeking samples and this information will only reinforce these natural prejudices, but every additional specimen improves the quality of the sample up to but not beyond 22-25.
The other thing that we can do with this type of data is generate generic curves for specific track-makers and/or environments by averaging the individual curves (Fig. 5). These are generated by using average standard deviations shown in Belvedere et al. (2021, table 1) to bootstrap between 50 and 100 length values these are then put through the SE modelling algorithm to produce average SE curves with 95% error margins. By repeating this process, a hundred times and averaging the results we obtain a stable set of  , table 2). In each case, SE curves are shown above, with increasing sample size superimposed on the histogram of identified change points determined from these curves. CI curves were excluded from this analysis. There are three points at which change-points are more commonly identified by the analysis. R 2 values for multiple linear regressions for a succession of samples each N−1 are shown below. R 2 values fall between infinity and 1, with the latter being a near perfect data fit. F I G . 4 . Data for long dinosaur trackways from various sites as set out in Belvedere et al. (2021, tables 3, 4). A, data from 16 trackways at five locations made by a bipedal tridactyl (theropod) dinosaur. B, dinosaur data from a total of 52 trackways consisting of paired manus and pes tracks (i.e. each trackway is represented by two curves) from the Canton Jura (Switzerland). C, regression plots between SE values in manus and pes tracks from the same trackway (i.e. same individual); axes are logged to improve the clarity of the plot; note that in most but not all cases variability in manus tracks is less than for pes tracks: equal variance = 27%, SE manus>SE pes = 23%, SE pes>SE manus = 50%.
with the potential SE for the sample size. Take for example the artificial sample in Figure 6 which shows a collection of tracks tentatively grouped into four categories. Table 1 shows the basic foot length data for this scenario and, based on the pair-wise comparison of the mean size differences in light of the estimated SE values in Belvedere et al. (2021, table 5), for a given sample size we can conclude tentatively that the minimum number of individuals is 3. Using the method of Webb et al. (2014), set out in Equation 1, the estimate is two individuals (specifically 2.34). The generic SE values provide a means of estimating potential SE and making conservative estimates on this basis. It is important to note that we have simply focused on the use of foot length for MNI estimates and it may be possible to develop superior measures using multi-dimensional properties.
We have chosen to only provide a SE est for human tracks here, since the dinosaur data used is from a limited number of sites and environments. Long dinosaur trackways are relatively common in the literature compared to human ones but the raw data are not always reported and they are often ichnotaxon-specific, making such data tables harder to compile. Collecting and presenting multiple dimensional measurements from long trackways is something we would encourage the community to focus on in future so that tables of SE est values can be compiled. An illustration of this point with respect to human tracks is the dataset in Hatala et al. (2020), which discussed several long trackways but despite documenting the total number of tracks, actually measured only a few. We would encourage the community to sample and report all track measurements to improve our understanding of natural intra-trackway variability.

CONCLUSION
Palaeontological and archaeological ichnological records can be fickle and rarely produce the ideal (i.e. large) statistical sample that one might hope for. Variability in tracks of the same individual adds to uncertainty when estimating the minimum number of individuals present on a trampled horizon or when making biometric inferences. The associated SE falls with increasing sample size in a predictable way, and is remarkably consistent across different track-makers, environments and measurement F I G . 5 . A, generic curves and estimated standard errors with CI of 95% produced by averaging the data in Belvedere et al. (2021, table 5). B, schematic illustration of recommended sample sizes; the arrow indicates increasing variance induced by different factors: substrate, track-maker, taphonomy and measurements; curves refer to possible trackways. protocols. Change-point modelling suggests that along this continuum one can identify three thresholds, which, although perhaps slightly artificial, can provide some guidance to the field scientist. The improvements in SE beyond a sample of 25 appear to be relatively minor. This threshold represents a reasonable maximum sample size if such a number is required for sampling permissions or conservation statements. The results presented here also suggest that a sample greater than 7-10 also gives better results than a smaller one, and that samples less than 7 are to be avoided if at all possible. In practice, however, small samples are often the norm where exposure and environment limit preservation. To assist with this, we provide a simple 'look-up' table for human tracks which provides estimated SE for small samples and may help the ichnologist frustrated by a small sample, make appropriately conservative estimates. We suggest that a similar look-up table could be developed for other vertebrates. One final word of caution, this outcome only applies to footprint length. More complex measures, involving multiple dimensions or the whole plantar surface of a track, for example, will involve more degrees of freedom and with them the minimum sample size is likely to increase substantially.
Acknowledgements. We wish to thank Dr Zarah Goshi (AstraZeneca PLC) for suggesting the statistical approach in the context of another project. We wish to thank the editor and the two reviewers (J. Lallensack and M. Buchwitz) for the comments and suggestions that helped to improve this paper.
Editor. Lorenzo Marchetti F I G . 6 . Artificial scenario with a series of tracks. See Table 1 and the text for explanation.
T A B L E 1 . Worked example of an MNI estimate using the scenario shown in Figure 6.  table 5) based on all categories (fossil/experimental soft/firm substrates) and in all but one case (comparison of tracks B and C) the difference between the means exceeds the maximum estimated SE est , approximated by the upper value of the 95% CI (CIU). The tentative conclusion is that there is evidence for only three track-makers.