Recently, two new proxies based on the distribution of glycerol dialkyl glycerol tetraethers (GDGTs) were proposed, i.e., the TEX86 proxy for sea surface temperature reconstructions and the BIT index for reconstructing soil organic matter input to the ocean. In this study, fifteen laboratories participated in a round robin study of two sediment extracts with a range of TEX86 and BIT values to test the analytical reproducibility and repeatability in analyzing these proxies. For TEX86 the repeatability, indicating intralaboratory variation, was 0.028 and 0.017 for the two sediment extracts or ±1–2°C when translated to temperature. The reproducibility, indicating among-laboratory variation, of TEX86 measurements was substantially higher, i.e., 0.050 and 0.067 or ±3–4°C when translated to temperature. The latter values are higher than those obtained in round robin studies of Mg/Ca and U37k′ paleothermometers, suggesting the need to primarily improve compatibility between labs. The repeatability of BIT measurements for the sediment with substantial amounts of soil organic matter input was relatively small, 0.029, but reproducibility was large, 0.410. This large variance could not be attributed to specific equipment used or a particular data treatment. We suggest that this may be caused by the large difference in the molecular weight in the GDGTs used in the BIT index, i.e., crenarchaeol versus the branched GDGTs. Potentially, this difference gives rise to variable responses in the different mass spectrometers used. Calibration using authentic standards is needed to establish compatibility between labs performing BIT measurements.
 Reconstruction of ancient seawater temperatures is of considerable importance in understanding past climate changes. Over the last decades several temperature proxies have been developed and used to reconstruct past seawater temperatures on the basis of inorganic or organic fossil remains. Two of the most popular tools are presently the Mg/Ca ratio of planktonic foraminifera [Nürnberg et al., 1996; Elderfield and Ganssen, 2000] and the U37K' ratio based on long-chain C37 alkenones derived from haptophyte algae [Brassell et al., 1986; Prahl and Wakeham, 1987].
 Recently, a second organic seawater temperature proxy based on archaeal glycerol dibiphytanyl glycerol tetraether (GDGT) lipids, the TEX86, was proposed [Schouten et al., 2002]. These lipids are biosynthesized by marine Crenarchaeota which are ubiquitous in marine environments and are among the dominant prokaryotes in today's oceans [Karner et al., 2001; Herndl et al., 2005]. Marine Crenarchaeota biosynthesize different types of GDGTs, i.e., GDGTs containing 0 to 3 cyclopentyl moieties (GDGT-0 to GDGT-3; see structures in Figure 1) and crenarchaeol which, in addition to four cyclopentyl moieties, has a cyclohexyl moiety (GDGT-4). Finally, they also biosynthesize small quantities of a crenarchaeol regio-isomer (GDGT-4'). A study of marine surface sediments showed that higher overlying sea surface temperatures result in an increase in the relative amounts of GDGTs with two or more cyclopentyl moieties. The TEX86 ratio was proposed as a means to quantify the relative abundance of GDGTs [Schouten et al., 2002]:
The TEX86 has recently been calibrated with annual mean sea surface temperature using marine sediment core tops with the following resulting equation [Kim et al., 2008]:
Studies have shown that this proxy can be analyzed in a range of sediments up to 120 My old and applied to the reconstruction of ancient sea surface water temperatures [e.g., Schouten et al., 2003; Forster et al., 2007]. TEX86 values in modern sediments range typically from 0.3 to 0.7 [e.g., Kim et al., 2008], while in ancient sediments they can be as high as 0.96 [e.g., Forster et al., 2007].
 In addition to archaeal GDGTs, bacterial GDGTs with nonisoprenoidal carbon skeletons also are encountered frequently in marine sediments (GDGT-I to GDGT-III, Figure 1). Several studies have now shown that they are especially abundant in soils and peats [Weijers et al., 2006] and progressively decrease in concentration from coastal sediments to open marine sediments, suggesting a terrestrial origin [Hopmans et al., 2004; Herfort et al., 2006; Kim et al., 2006]. Hopmans et al.  proposed the BIT index to quantify the relative abundance of these bacterial GDGTs versus crenarchaeol as a proxy for the input of terrestrial organic matter into marine sediments:
 A prerequisite for the wider application of these proxies is the robustness and analytical reproducibility of their analysis. This is especially important with these proxies as they are analyzed by high-performance liquid chromatography (HPLC) coupled to mass spectrometry (MS) [Hopmans et al., 2000; Schouten et al., 2007; Escala et al., 2007], a technique that was, until recently, not commonly used in many paleoceanographic and organic geochemical laboratories. A common procedure to establish the robustness and reproducibility of an analytical method is a round robin study, as has been done for the U37K′ ratio of long-chain C37 alkenones [Rosell-Melé et al., 2001] and for the Mg/Ca ratio of (foraminiferal) carbonates [Rosenthal et al., 2004; Greaves et al., 2008]. To assess the reproducibility of the HPLC/MS technique for TEX86 and BIT analysis, we performed an anonymous round robin study on filtered polar fractions obtained from extracts of two sediments, following the general outline and methods as in previous paleoceanographic proxy round robin studies by Rosell-Melé et al.  and Rosenthal et al. .
2. Materials and Methods
 A general invitation was sent to a large number of laboratories to participate in an anonymous round robin study, to which 21 labs responded positively. To assess systematic errors in TEX86 and BIT analysis these labs received two vials, each containing 1 mg of a polar fraction of a sediment extract labeled S1 and S2, prepared at the NIOZ Royal Netherlands Institute for Sea Research. Labs were requested to analyze the samples when their HPLC/MS set up was performing well according to their criteria and to inject sufficient enough amounts to be above the limit of quantification [cf. Schouten et al., 2007]. The vials were distributed by the end of August 2007 and results reported here are those of the fifteen labs which reported their results before 1 January 2008. One lab (14) reported their results for S1 after this deadline. Their results are included in Tables 1–4 but are not considered further in this study.
Table 1. HPLC/MS Methods Used by Participants in the Round Robin Studya
Hex, hexane; IPA, isopropanol; SIM, selected ion monitoring; TOF, time of flight; na, not applicable.
 The standards comprised filtered polar fractions of sediment extracts labeled S1 and S2. Sediment S1 was derived from a piston core taken in the Drammensfjord, Norway (D2-H; 59 40.11 N, 10 23.76 E; water depth 113 m; sediment depth 746–797 cm). Sediment S2 was derived from a gravity core (TY92–310G; 16 03 N, 52 71 E; 880 m water depth; 0–42 cm depth) taken in the Arabian Sea. The reason to choose these two sediments is that they were expected to cover a large range of TEX86 (temperate versus tropical) and BIT (coastal versus open ocean) values.
 The sediments were freeze-dried and Soxhlet extracted for 24 h using a mixture of dichloromethane (DCM) and methanol (7:1, v/v). The combined extracts were separated over a column filled with alumina oxide into an apolar and polar fraction using hexane: DCM (9:1, v/v) and DCM:methanol (1:1, v/v), respectively. The resulting pooled polar fraction was condensed by rotary evaporation and further dried under a stream of nitrogen. The polar fraction was weighed and dissolved in hexane/isopropanol (99:1, v/v) in a concentration of 2 mg/ml. Aliquots of 1 mg were filtered using a PTFE 0.4 μm filter, dried under a stream of nitrogen and distributed to the different labs.
2.2. TEX86 and BIT Analysis
 All labs used HPLC/Atmospheric Pressure Chemical Ionization (APCI)/MS to analyze GDGTs. The HPLC methods used by the different labs are listed in Table 1 and generally followed that of Schouten et al. , i.e., a cyano column with a hexane-isopropanol gradient as the mobile phase. Injected sample sizes ranged from 3 to 300 μg of filtered polar fraction. Base peak chromatograms of HPLC/MS analyses of S1 and S2 are shown in Figure 1.
2.3. Statistical Analysis
 Statistical analysis was based on the international standard ISO 5725 for interlaboratory tests [International Organization for Standardization, 1986]. Repeatability (r) and reproducibility (R) values were estimated. The repeatability r should be interpreted as the value below which the difference between two single test results obtained by the same method on identical test material under the same test conditions (same operator, same apparatus, same laboratory and within a short interval of time) may be expected to lie with a probability of 95%. The reproducibility R should be interpreted as the value below which the difference between two single test results obtained by the same method on identical test material but under different test conditions (different operators, different apparatus, different laboratory and not necessarily within a short interval of time) may be expected to lie with a probability of 95%. Under these definitions, all laboratories are considered to be using the “same method,” and R refers to interlaboratory results, while r refers to intralaboratory results. Outlying data and labs were detected by visual inspection of normal probability plots of laboratory means, chi-square probability plots of laboratory variances and Bartlett's test for homogeneity of variances.
3. Results and Discussion
 The results discussed here of the anonymous round robin study of two sediment extracts, labeled S1 and S2, are based on the fifteen labs which reported their results before the deadline of 1 January 2008. The results of the TEX86 and BIT analyses of the different labs are listed in Tables 2 and 3 and plotted in Figures 2 and 3, while the methods used are summarized in Table 1. All labs used almost identical LC conditions (solvent gradients, column type) but a variety of mass spectrometry techniques, i.e., eight labs used quadrupole MS, six labs used ion trap MS and one lab used time-of-flight MS (TOF). Note that most labs analyzed the samples within 1–2 days and thus standard deviations listed do not represent long-term reproducibility. Furthermore, since labs received “ready-to-inject” polar fractions, the results do not allow evaluation of the effects of individual sample work up procedures as was done for the U37K′ ratio of long-chain C37 alkenones [Rosell-Melé et al., 2001] and for the Mg/Ca ratio of (foraminiferal) carbonates [Rosenthal et al., 2004].
3.1. TEX86 Analysis
 The results of the TEX86 analysis are listed in Table 2 and shown in Figure 2. In Figure 4a we plotted the distribution of TEX86 values for both samples S1 and S2. The results have a reasonably Gaussian-like distribution with a broader range for sample S1. We then statistically identified (see section 2.3) four outliers for S1 (labs 4,5,7,19) and one outlier for S2 (lab 5) which were removed from subsequent statistical treatment. These anomalous results cannot be attributed to a particular mass spectrometric technique since the outliers were from two labs using a quadrupole MS and two labs using an ion trap MS (Table 1 and Figure 2).
 The estimated repeatability for TEX86, after removal of the outliers, was 0.028 and 0.017 for S1 and S2, respectively (Table 5). The reproducibility, however, was slightly higher for S2, i.e., 0.067 compared to 0.050 for S1. However, the variance estimate for S1 was made after removal of four outliers. Removal of only the most severe outlier (lab 5) would have resulted in a reproducibility of 0.092. If we convert these TEX86 values to temperatures [Kim et al., 2008] then the repeatability of TEX86 analysis corresponds to 1.9 and 1.1°C for S1 and S2, respectively, while the reproducibility corresponds to 3.3 and 4.5°C for S1 and S2, respectively (Table 5). The better repeatability and, when taking account of the number of outliers removed from S1, reproducibility of sample S2 likely is due to the higher abundances of the minor GDGTs, GDGTs 1–3 and GDGT-4', relative to GDGT-0 and crenarchaeol (GDGT-4). This is likely to have enabled a more reliable quantification, as amounts were not only above the limit of detection but also above the limit of quantification which is likely to be an order of magnitude higher for TEX86 analysis [cf. Schouten et al., 2007].
 To investigate potential causes for outliers and differences between labs, we plotted the TEX86 values of S1 against S2 (Figure 5a). This reveals that, in general, there is a tendency toward some systematic difference. For example, outliers in TEX86 measurements of S1 also tend to be outliers in TEX86 measurements of S2. This suggests that the differences between labs are not caused by inhomogeneity between individual vials of the standards. Another potential cause for the differences may be the “integration style” used, i.e., which criteria were used to define peak starts and ends. The latter can be important because coelutions occur between the GDGTs of interest and other minor isomers. Therefore, labs were asked to reintegrate the peak areas in their chromatogram according to a prescribed format and preferably by a person not aware of the previous results. Twelve labs reported the results of this exercise which showed that with only a few exceptions, the changes in TEX86 were relatively minor (Table 4) and unlikely to account for the observed differences. The results of lab 5 are, for both samples, outliers. Examination of their LC/MS equipment revealed that the cone of the nebulizer was not well aligned with the ion source, and the inner surfaces of the interface had some chemical residues. This highlights the fact that maintenance of the APCI interface is of prime importance to obtain consistent and robust results.
 The results obtained for TEX86 analysis compared reasonably well to those obtained for other paleothermometers, especially considering the relatively recent development of the proxy. Rosell-Melé et al.  found for U37k′ analyses of several sediments a repeatability of 1.6°C, but their reproducibility of 2.1°C was substantially better than obtained in our study. Rosenthal et al.  reported a repeatability of 1–2°C and a reproducibility of 2–3°C for Mg/Ca analysis of foraminifera, also numbers that are similar to our study. These estimates also already contained biases induced by work up procedures, something which is not applicable in our study. In fact, the reproducibility of standard mixtures, which does not include biases by sample work up, is even better at 0.5 and 1.3°C for Mg/Ca and U37k′, respectively. Thus, our interlaboratory study suggests that repeatability (r) of TEX86 temperatures is similar to those of other paleothermometers but that the reproducibility (R) among labs is significantly higher. Hence, there is a need to improve reproducibility between labs using standards or calibrations. It also should be noted, however, that a large number of the participating labs had relatively little experience in analyzing GDGTs using HPLC/APCI/MS at this point. Presumably, the robustness of these analyses will improve with increasing experience.
3.2. BIT Analysis
 The results of the analysis of samples S1 and S2 for BIT are displayed in Table 3 and Figures 3 and 4. Sample S2 is from an open marine sediment with a small contribution of soil organic matter, and thus values are nearly all below 0.1 (Figure 3b and Table 3). On the basis of the Bartlett's test four outliers were removed (labs 5,11,15,20) but the variability remained inhomogeneous even after removal of these four outliers. The repeatability was 0.004, while the reproducibility was much larger at 0.028. Sample S1 is from a Norwegian fjord, which likely contains substantial amounts of soil organic carbon [e.g., Huguet et al., 2007]. Indeed substantially higher BIT indices were measured for this sample than for S2. However, a large spread in BIT values ranging from 0.25 to 0.82 (on a scale from 0 to 1; Figure 3a and Table 3) and a broad nonuniform distribution were found (Figure 4b), quite different from that observed for the TEX86 measurements. For BIT measurements of S1, the repeatability estimate is 0.029 while the reproducibility estimate is high at 0.410 even after removing three outliers on the basis of Bartlett's test (Table 5).
Table 5. Summary Statistics of All Measurements Made by the Different Laboratories
 The large reproducibility estimate for sample S1 and the inhomogeneity in variances between different BIT measurements is striking. It suggests that the BIT index can be determined by most labs fairly reproducibly but that there are considerable differences between labs. This suggests that there is some major underlying problem in determining the BIT index which is not apparent for TEX86 analysis, even though both parameters are measured in a single analysis. A similar reintegration exercise was performed for the BIT measurements as with the TEX86 measurements but again this did not result in substantial changes in the reported results (Table 4). Plotting the results of BIT measurements of S1 against S2 shows that the differences are systematic (Figure 5b) and thus again cannot be due to inhomogeneity between the distributed vials. Furthermore, there is no particular distinction in BIT values based on the type of mass spectrometer used (Figure 3), nor do similar clusters form among laboratories as those found for the TEX86 results (Figure 5a).
 There may be several reasons for this large spread in BIT indices. First, branched GDGTs have a later elution time. Most chromatographic programs made use of a hexane-isopropanol gradient and thus, depending on the elution time, a varying percentage of isopropanol may have been present in the APCI chamber during the ionization of the branched GDGTs compared to the amount present during the ionization of crenarchaeol (GDGT-4). This may have given rise to differences in the ionization efficiency of the GDGTs in the APCI and thus variation in the BIT index may depend on the chromatographic behavior of the GDGTs on the LC column. However, at the NIOZ lab similar BIT values were obtained for S1 despite variations in retention time of up to 5 min or when using an isocratic elution program, suggesting that varying isopropanol concentrations does not have a major effect on the indices measured. Second, and likely more importantly, there is a large mass difference between branched GDGTs (m/z 1022–1050) and crenarchaeol (m/z 1292). Thus, the BIT index will be more affected by the mass calibration and tuning of the mass spectrometer used, in contrast to the TEX86, where mass differences of the GDGTs used are much smaller (m/z 1300 to m/z 1292). This difference does not depend on the type of mass spectrometer (Figure 3). To solve this problem unequivocally, mixtures of authentic standards of crenarchaeol and a branched GDGT in known ratios are required, something which needs to be considered in future round robin studies. Until then, it is clear that the BIT index can only be used as a crude qualitative measure for the relative input of soil organic matter in coastal systems. The results also have consequences for assessing biases in TEX86 using an absolute BIT value [cf. Weijers et al., 2006]. Instead, it may be possible to assess this bias by correlating BIT values with TEX86 values, i.e., large changes in soil organic matter input, and thus in the BIT index, will likely lead to changes in the TEX86.
 An anonymous interlaboratory study of TEX86 and BIT analysis of two sediment extracts was carried out by fifteen different laboratories around the world and revealed relatively large variances between the different labs, especially for BIT analysis. Repeatability of TEX86 analysis was, in terms of temperature, similar to the work-up and analytical repeatability of other paleothermometers (±1–2°C) but the reproducibility between labs was larger (±3–4°C), indicating the need for improved analytical protocols. Paleotemperature reconstructions based on TEX86 therefore are likely to perform as well as other proxies for determining magnitudes and rates of climatic changes, based on the generally good laboratory repeatability. The poor reproducibility will only impact the reconstruction of absolute temperatures. For BIT values the reproducibility was large (0.410), potentially because of differences in mass calibration and tuning of the mass spectrometers used. Our results suggest that there is a clear need for further round robin studies which should include the use of mixtures of authentic standards, constraining the effects of mass calibrations and tuning set ups, evaluation of sample work up procedures and the monitoring of long-term reproducibility.