Calling all archaeologists: guidelines for terminology, methodology, data handling, and reporting when undertaking and reviewing stable isotope applications in archaeology

Abstract Stable isotope analysis has been utilized in archaeology since the 1970s, yet standardized protocols for terminology, sampling, pretreatment evaluation, calibration, quality assurance and control, data presentation, and graphical or statistical treatment still remain lacking in archaeological applications. Here, we present recommendations and requirements for each of these in the archaeological context of: bulk stable carbon and nitrogen isotope analysis of organics; bulk stable carbon and oxygen isotope analysis of carbonates; single compound stable carbon and nitrogen isotope analysis on amino acids in collagen and keratin; and single compound stable carbon and hydrogen isotope analysis on fatty acids. The protocols are based on recommendations from the Commission on Isotopic Abundances and Atomic Weights of the International Union of Pure and Applied Chemistry (IUPAC) as well as an expanding geochemical and archaeological science experimental literature. We hope that this will provide a useful future reference for authors and reviewers engaging with the growing number of stable isotope applications and datasets in archaeology.

relating to the bulk stable carbon and nitrogen isotope analysis of organics; bulk stable carbon and oxygen isotope analysis of carbonates; single compound stable carbon and nitrogen isotope analysis on amino acids (AA) isolated from collagen and hair keratin, and stable carbon and hydrogen isotope analysis of fatty acids from artefacts, bone, and sediments based on IUPAC 10 recommendations and an expanding geochemical and archaeological science literature.
While some reviews have touched on a few of these themes in the context of geochemistry as a whole, 11,12 and more recently in forensics, 13 we have written this article to directly increase information flow to archaeological science practitioners, students, and reviewers less familiar with these techniques. We hope this will ensure that archaeologists continue to make substantial contributions to cross-disciplinary advancements in mass spectrometry methods and applications.

| A NOTE ON TERMINOLOGY
Tyler Coplen 11 has previously provided a thorough discussion of stable isotope terminology in Rapid Communications in Mass Spectrometry.
However, many archaeological isotope publications use incorrect terminology, some of which has also been highlighted by Zachary Sharp for geochemistry in general. 14 Firstly, the isotope ratio for samples and standards obtained from Isotope Ratio Mass Spectrometers is in the form of 13 C/ 12 C, 15 The delta notation (δ i E sample ), defined through Equation 1, is commonly employed in reporting stable isotope results. For a certain chemical element (E) the ratios of the abundances of the heavier (i) versus lighter (j) isotope are measured in a sample ( i E j E sample ) and in a SRM with an internationally accepted δ value ( i E j E reference ). 15 Equation 1 expresses the relative difference in absolute isotopic abundance ratios of a sample against the chosen SRM. Per mil notation (‰) is used as a convenient means of reporting small numerical values. 16 This is not a SI unit of measurement; it is simply a unit of comparison of the sample with a standard with an internationally recognized isotopic abundance. 11,15 It is correct to state that an archaeological material such as bone collagen has a "stable carbon isotope composition". However, it is not possible for bone collagen to have a δ 13  There are often also terminological issues when comparing different samples or measurements during the reporting of results or interpretation. The term 'isotopically depleted' is inappropriate when referring to δ values of archaeological materials. This is not only due to the vagueness of this term (as is also the case with discussing an 'isotopic composition' or 'isotopic value' above), but also because a sample of collagen, plant, or carbonate is not depleted or enriched in isotopes generally. Instead the terms depleted or enriched should be reserved for changes in the proportion of the heavy isotope (e.g. 13 C) of the element in a given substance or fractionation process.

| ACKNOWLEDGING DIAGENESIS AND SELECTING A PRETREATMENT
The potential for variability in burial environment (e.g., humidity, pH, microbial attack, temperature and time) to alter the in vivo isotope values of bone collagen, bone bioapatite, dentine collagen, and tooth enamel bioapatite has been documented in archaeological applications since the 1980s. [17][18][19][20] Furthermore, the mechanisms behind these changes, especially for tooth enamel bioapatite and bone collagen, are relatively well known. 5,18,21 While this has led to basic diagenetic checks that are applied in many archaeological science publications ( Diagenesis is an important problem to address. However, it is also important to note that although studies might document diagenetic alteration of isotopic ratios in certain archaeological materials and contexts (e.g. 5 ), this does not mean that the same materials should not be analyzed in other settings. [25][26][27] This has been a particular problem in stable carbon and oxygen isotope studies of tooth enamel, which have only recently begun to expand in archaeology following a period of distrust. Opinions of the utility of different materials for stable isotope analysis need therefore to be formed on the basis of the wider, current literature rather than a few papers and preconceptions.
It is also important for researchers and reviewers to probe the impacts that different pretreatment protocols might have on a sample's material structure and δ value. 3,23,28 A plethora of different pretreatment techniques are still being used for the isotope analysis of bone collagen, bone bioapatite, tooth enamel bioapatite, tooth dentine, and plant remains, impacting the reliability of cross-comparison (e.g. 28 ). In many cases, such as for tooth enamel bioapatite and bone collagen, the effects of different techniques are minimal. 3,29 However, choices of pretreatment for bone bioapatite, 3,28 shell and soil carbonate, 30 and crop remains 23 can be more significant in attempts to replicate the in vivo δ value. In these latter cases, it is not unreasonable for a reviewer to request information regarding sample pretreatment and sample preparation if this has not already been provided.
Diagenetic and pretreatment biases for compound-specific approaches remain relatively under-explored and un-reported in the context of archaeological science. For the stable isotope analysis of proteinogenic amino acids, similar measures to those employed in bulk isotope analyses of bone collagen are used to assess diagenesis (Table 1). When evaluating amino acid δ 15 N data, there should be only minor differences between proline (Pro) and hydroxyproline (Hyp) δ 15 N values in collagen since the nitrogen in Hyp derives from Pro. 31 Similarly, the expected slope for the δ 13 C Hyp value as a function of the δ 13 C Pro value is 1 because, after formation of immature collagen, Hyp is synthesized exclusively from Pro. 32 This procedure may induce isotope effects on certain amino acids as reported previously in the literature. 39 For this reason, it is recommended that each lab investigates whether their cleaning protocols have isotope effects in order to correct for these offsets.
For lipids from artefacts, bone, and sediments, the normal practice is to screen the samples with gas chromatography (GC) and gas chromatography mass spectrometry (GC/MS) to verify the compounds of interest and check for co-elution and contamination (Table 1).
Contaminants, such as from plastic sample bags, should be reported and accounted for. Negative controls involving blank extractions and extractions of associated sediments and pottery should be routinely performed. Similarly, amino acid chromatograms should be checked for co-elution from non-proteinogenic compounds.

| SAMPLING
Problems of diagenesis and taphonomy (e.g. funeral practices) are also the primary cause of an inevitable issue in archaeological science stable isotope applications: low or variable sample size. The nature of archaeological preservation makes it very difficult to be prescriptive in sample size necessity in different isotopic applications.
However, it is important that both the variability within a given sample (e.g. a single bone) and the variability within the population under study (e.g. a sample of the same skeletal element from a series of different individuals) are taken into account when interpretations relating to δ differences between groups are being made. Furthermore, if a confident difference is to be asserted, the sample size must be large enough for the chosen statistical test to operate effectively.
In order to determine variability within a given type of sample for a particular study we recommend a pilot study measuring at least three repeat aliquots that are extracted and pretreated separately from a single sample (e.g. a bone). The measurement standard deviation for these extracts will provide a useful evaluation of δ uncertainty resulting from burial environment, pretreatment, and natural heterogeneity in a sample. It is especially important to do this when an archaeological material is being analyzed for the first time FIGURE 2 Comparison of published collagen and dentin proline and hydroxyproline δ 13 C values from archaeological materials shows that are only minor offsets between these amino acids (y = 0.62 + 1.03x, adjusted R 2 = 0.9252, F(1,184) = 2290, P <0.001). The data were obtained from five studies that used LC/IRMS. 34  reported, even if only in supplementary information. 10 We recommend that these records of accuracy be kept up-to-date and used as an ongoing measure of laboratory performance that can even be publicly displayed.
We would recommend using more than one check SRM to test whether this magnitude is similar across slightly different materials. Precision measured through repeat analyses is not necessarily the best overall measure of uncertainty in sample δ values, however. 49 A better measure is obtained by propagating errors across the whole process of sample selection, preparation, measurement, and normalization. This can be done using a 'bottom up' approach (e.g. 49,50 ) or via a 'top down' approach. 13 Together, these criteria enable archaeological science-focused laboratories to validate their pretreatment and instrumental methods more widely, the importance of which has recently been emphasized in forensic studies. 13 Given the growing importance of data compatibility between laboratories, and over long time scales for archaeological interpretation between populations and sites, it is essential that each study and laboratory demonstrate that its methods of sample selection, pretreatment/ extraction, measurement, and calibration meet accepted criteria.
In the case of a given study, we recommend the analysis of an SRM with a known isotope δ value and treating it alongside samples to establish the degree to which treatment and measurement causes sample δ values to deviate from their 'true' value. This is particularly important in single-compound approaches. For example, in Figure 4 we show the results of δ 13 C analyses of fatty acids repeatedly extracted from homogenized pulverized pottery sherds to assess within-lab reproducibility. The uncertainty achieved through this exercise (±0.6 ‰ 1σ) was greater than obtained from repeated measurements of a single extract (typically ±0.3 ‰ 1σ), emphasizing the importance of propagating errors beyond measurement precision and the importance of extracting a SRM 'sample' during extraction. It is also particularly important when reporting this procedure to account for the addition of C and H during the derivatization process. As a result, the isotope ratios of reference compounds should be reported before and after derivatization with measurement uncertainties.
In the case of inter-laboratory validation, inter-laboratory comparisons should be designed to take into account the points raised above relating to uncertainty arising from pretreatment, calibration, and standard use, in order to enable laboratories to critically identify the largest sources of errors during sample cleaning, extraction, and  isotope measurements also provide method information that could be practically followed by any researcher seeking to emulate the study, these data are often missing when isotope data is provided as supplementary information to radiocarbon dating. δ 13 C and δ 15 N values are often presented for bone collagen samples that have been radiocarbon dated, but information relating to their preparation, normalization, and comparison with reference materials should also be provided so that it can be determined whether these results are useful for subsequent stable isotope comparisons.
Full data reporting is also essential in publications. This is, in part, because it is useful to have datasets available for comparison with existing literature, particularly given the rise of 'big data' approaches also within archaeology. 53 One of the most common oversights in this regard is the production of a mean and standard deviation plot for a given human group or faunal group as a useful, simplified graphical representation of one's data 'average' and 'spread'. Nevertheless, in some cases the corresponding data table will also only report the mean and standard deviation for that group without an additional  54 If Suess effect corrections are made these need to be explicit and include appropriate uncertainties. In general, all calculation and correction stages applied to raw data should be documented in a publication and its associated tables for transparent evaluation.
Where groups are compared, full archaeological context information, and the logic behind such groupings, should be provided.
Transparent methodology is also often missing in many compound-specific isotope papers, hindering replication and application beyond select groups. Given the rapid development of these specialist approaches and the ample space available in supplementary information sections this should be remedied. There is also a growing problem of data reporting in more recently applied compound-specific approaches.
Here studies must provide the final δ value for each sample (mean and analytical standard deviation) across replicate injections or include chromatograms of representative samples.
One potential solution to these problems is a central repository for archaeological isotopic data, and calls for this have been put forward. [55][56][57] Central data repositories, such as GenBank, exist for the field of archaeogenetics and proteomics. 58,59 However, it is also important to notice that applications of stable isotope data are extremely diverse and different research fields have specific data requirements. Partnership-based initiatives, such as IsoMemo, attempt to bring together multiple repositories of stable isotope data from archaeology, ecology, and environmental sciences. 60 The goals of the initiative are to coordinate data collection efforts, sharing and centralization of data, creating tools (e.g. user-friendly graphical interfaces) for facilitating data access, building interdisciplinary projects, and establishing common data standards.
The latter includes, for instance, adoption of common terminologies and the assignment of unique codes to stable isotope labs and for There can be interpretive issues with the use of axes scales in terms of both scale length and scale divisions in bi-plots of two isotope parameters, the most common being δ 13 C values plotted against δ 15 N values for bone collagen, dentine collagen, or crop remains. Variation in ‰ in a given human or ecological system may be greater for δ 15 N than δ 13 C values, and vice versa. δ 15 N, δ 13 C, and δ 18 O values will also vary significantly between sites, periods, and ecological systems. These factors may lead to decisions to adopt different scale lengths and divisions for different isotopic parameters or archaeological periods and sites ( Figure 5A). However, if this is done, researchers must be wary of statements such as: " Figure 5A shows that the variation in δ 15 N values is greater or the same as for δ 13 C values" Or " Figure 5A shows large variation in δ 13 C values".
Similarly, in Figure 5B, two sequential plots of δ 18  It is important that all data points are displayed in graphs where possible, even if only in the supplementary information, and not just a summary representation (e.g. a boxplot or mean and standard deviation plot). This is a particularly significant problem with the growing use of ellipses to summarize archaeological δ parameters for a particular data group. These can be drawn using different methods and principles that should be clearly outlined. Furthermore, the confidence level of the ellipse should be stated. In single-compound analysis of fatty acids in archaeological pottery, for example, the archaeological data are commonly plotted against 68 % confidence ellipses of modern foods corrected for the Suess effect (e.g. 66 ). As can be seen in Figure 6, a confidence ellipse of 50 % and one of 95 % can have very different relationships to the real spread of the data. This is particularly sensitive for the relatively small sample sizes that are frequent in archaeological research. overlap and can be stated as such. However, a "significant" difference is a term limited to statistical analysis and must be used accordingly.
It is also important that the appropriateness of the chosen statistical test is discussed. 69 76 that the total number of observations must be significantly greater than the number of independent variables (>5:1), and that the number of observations in the smallest group must be greater than the number of independent variables.

| CONCLUSIONS
This review is not meant as a criticism directed towards any particular group of researchers, and indeed the authors have certainly been guilty of some of the above-listed failings at some point. In the interest of best scientific practice, we have attempted to summarize the basic requirements of terminology, sampling, measurement, reporting, display, and analysis in the presentation and publication of isotope data. Much of this is dictated by IUPAC 10 and influential members within geochemistry, mass spectrometry, and archaeological science itself, have leveled similar criticisms. However, as an increasing number of students and researchers from different academic backgrounds enter archaeological science it is beneficial to circulate these widely within an archaeology-focused format that is as accessible as possible to this readership (including by making this article Open Access).
We have focused on bulk stable carbon and nitrogen isotope analysis of organics; bulk stable carbon and oxygen isotope analysis of carbonates; and single-compound stable carbon and nitrogen isotope analysis on amino acids in collagen and keratin; and single-compound stable carbon and hydrogen isotope analysis on fatty acids. However, with developments in archaeological science applications, the same principles of terminology, sampling, calibration, quality assurance and control, graphical representation, and statistical analysis will apply to the currently rarer, or more experimental, applications of bulk organic sulfur, hydrogen, and oxygen stable isotope analyses and stable hydrogen analysis of amino acids.
We request that reviewers agreeing to evaluate publications involving isotope analysis familiarize themselves with these requirements, making the appropriate critique and suggestions where necessary, so as to raise the standard of science, data production, and data availability in this ever-expanding and advancing field.