Statistical modelling of measurement error in wet chemistry soil data

There is a growing demand for high‐quality soil data. However, soil measurements are subject to many error sources. We aimed to quantify uncertainties in synthetic and real‐world wet chemistry soil data through a linear mixed‐effects model, including batch and laboratory effects. The use of synthetic data allowed us to investigate how accurately the model parameters were estimated for various experimental measurement designs, whereas the real‐world case served to explore if estimates of the random effect variances were still accurate for unbalanced datasets with few replicates. The variance estimates for synthetic pHH2O data were unbiased, but limited laboratory information led to imprecise estimates. The same was observed for unbalanced synthetic datasets, where 20, 50 and 80% of the data were removed randomly. Removal led to a sharp increase of the interquartile range (IQR) of the variance estimates for batch effect and the residual. The model was also fitted to real‐world pHH2O and total organic carbon (TOC) data, provided by the Wageningen Evaluating Programmes for Analytical Laboratories (WEPAL). For pHH2O , the model yielded unbiased estimates with relatively small IQRs. However, the limited number of batches with replicate measurements (5.8%) caused the batch effect to be larger than expected. A strong negative correlation between batch effect and residual variance suggested that the model could not distinguish well between these two random effects. For TOC, batch effect was removed from the model as no replicates were available within batches. Again, unbiased model estimates were obtained. However, the IQRs were relatively large, which could be attributed to the smaller dataset with only a single replicate measurement. Our findings demonstrated the importance of experimental measurement design and replicate measurements in the quantification of uncertainties in wet chemistry soil data.

• Linear mixed-effects models can be used as a tool to quantify uncertainty in wet chemistry soil data.
• Lack of replicate measurements leads to poor estimates of error variance components.
• Measurement error in wet chemistry soil data should not be ignored.
accuracy, experimental design, linear mixed-effects model, replicate measurements, soil chemical data, uncertainty 1 | INTRODUCTION A soil system's physical and chemical properties are commonly determined by the collection and subsequent wet chemistry analysis of soil samples. The results from wet chemistry measurements can be further used to, for instance, develop soil spectroscopy models (McBratney, Minasny, & Rossel, 2006) or estimate soil organic carbon stocks (Smith et al., 2020). Accurate and reliable analytical data of relevant soil properties are key to achieve accurate calibration and validation of such models (Dangal, Sanderman, Wills, & Ramirez-Lopez, 2019). The need for high-quality soil data is widely acknowledged by organizations such as the FAO's Global Soil Partnership, which established the Global Soil Laboratory Network (GLOSOLAN) in 2017 (FAO, 2019a). GLOSOLAN aims to build laboratory capacity and improve the provision of reliable and comparable soil data by harmonizing methods, units, data and information.
During the measurement process, errors can occur from field sampling (e.g., Goidts, Van Wesemael, & Crucifix, 2009), sample handling, sample transport, shipment preparation, taking a subsample for laboratory analysis and the laboratory analysis itself (Van Ee, Blume, & Starks, 1990). Errors can also occur after the laboratory analysis, for example during the data processing. Other post-analysis errors are model error, present in for instance pedotransfer functions and spectral models (Libohova et al., 2019), and interpolation error, included in digital soil mapping. In this study, we focused on the error that occurs during the laboratory analysis: the laboratory measurement error. Factors that often contribute to measurement error are the analyst, complex wet chemistry methodologies, varying measurement conditions (e.g., temperature and humidity), a variety of different sample preparation methods and the measurement instrument itself (Allchin, 2001;Libohova et al., 2019;Viscarra Rossel & McBratney, 1998). Error in soil measurements may be defined as the difference between the 'true' value of a soil property and its measured value (Hibbert, 2007). GLOSOLAN aims to reduce such method-related errors through harmonization of standard operating procedures (SOPs) for commonly used wet chemistry methods, and development of quality assurance and quality control programs. In this research, we aimed to quantify the uncertainty associated with defined analytical methods, building upon the need for high-quality soil data.
Errors in wet chemistry soil data can propagate in further applications, such as pedotransfer functions, spectral models and digital soil mapping (Heuvelink, 2018;McBratney, Minasny, Cattle, & Vervoort, 2002). Over time, the prediction accuracy of such models has improved significantly, making the relative contribution of measurement error in the calibration data larger. For example, visible (Vis), near-infrared (NIR) and midinfrared (MIR) diffuse reflectance spectroscopy models have improved rapidly due to advances in computation, instrument manufacturing and multivariate statistics (Dangal et al., 2019;Guerrero, Viscarra Rossel, & Mouazen, 2010). Furthermore, developments in computation and statistics helped to extract useful information from the measured spectra (Viscarra Rossel et al., 2016).
The 'true' value of a soil property must be known to quantify errors in soil measurements. However, we rarely have this knowledge and therefore we are uncertain about the 'true' value. For instance, suppose a laboratory clay content measurement yields a value of 24.2%, while the 'true' clay content is 27.5%. The error is 3.3%, but we do not know it because we have only the measurement. Because we are aware that there may be a measurement error, we are uncertain about the 'true' clay content. Heuvelink, Brown, and van Loon (2007) defined uncertainty as an expression of confidence in our knowledge of the 'true' value of a specific soil property. Several methods have been developed to deal with uncertain data. Commonly, uncertainties are quantified through probability distribution functions (PDFs) (Heuvelink, 2018;Heuvelink et al., 2007). This approach allows for a complete characterization of the uncertainty, including correlations between uncertainties in soil measurements. Just as soil observations can be dependent on each other, so can the uncertainties associated with them. Furthermore, PDFs can easily be implemented in the uncertainty propagation and stochastic sensitivity analysis of environmental models (Malone, McBratney, & Minasny, 2011). Despite the availability of multiple methods, uncertainty estimates are rarely specified by providers of wet chemistry soil data, meaning that the user has little to no knowledge about the quality, or uncertainty, of these data. Furthermore, measures for uncertainty associated with compilations of such analytical datasets are seldom provided, with the exception of the WoSIS database (Batjes, Ribeiro, & van Oostrum, 2020). Hence, there is a need to quantify uncertainties in wet chemistry soil data and store detailed uncertainty information in soil databases.
In this research, we aimed to model uncertainties in synthetic and real-world wet chemistry soil pH H 2 O and total organic carbon (TOC) measurements through PDFs. For this, we assumed that we can represent uncertainties by normal distributions. To estimate the parameters of these distributions, that is, the variances of the error components, we applied a linear mixed-effects model approach. The use of balanced and unbalanced synthetic data allowed us to investigate how well the model parameters (i.e., the variances of the error components) can be estimated given a specific experimental measurement design. We aimed to provide guidance on the 'best', or recommended, experimental measurement design for accurate representation of the laboratory measurement error through PDFs, based on the results of the synthetic case. Furthermore, the model was applied on real-world data provided by the Wageningen Evaluating Programmes for Analytical Laboratories (WEPAL) (http://www.wepal.nl). This real-world case study served to quantify the contribution of multiple error sources to the overall measurement uncertainty and to explore if the error source variance estimates were still accurate for such unbalanced designs with only few replicate measurements.

| ERROR IN ANALYTICAL CHEMISTRY
Analytical measurements are subject to various error sources, such as the analyst and the instrument. Because of error, a measurement on a soil sample for a given property is considered to be only an approximation of the 'true' value. To assess the quality of a measurement, we wish to know the size of the error. In the case of repeated measurements, the measurement error can be divided into a systematic and a random component. The systematic error will be the same for all repeated measurements, whereas the random error may vary between replicates. Each measurement is the sum of the 'true' value, a systematic error and a random error: where Y i is the i-th measurement, X is the 'true' value of the soil property and n refers to the total number of measurements performed on the soil sample. The measured value differs from the 'true' value by a systematic component, S , and a random component, ε i , which differs for each measurement (Figure 1). In this research, we assume that the random error follows a normal distribution, with zero mean and standard deviation σ. Because it has zero mean, it is theoretically possible to eliminate the random error by measuring the same sample an infinite number of times (Theodorsson, Magnusson, & Leito, 2014). The systematic error in Equation (1) affects the results by the same amount (i.e., we assume S to be constant). The systematic error can also be proportional to the 'true' value (Ramsey, 1998). Here, S would be dependent on X (e.g., S ¼ cÁX, where c is a constant).
In analytical chemistry, the error in measurement results is typically addressed through applying a method validation scheme (Hibbert, 2007). Commonly used terms in method validation schemes are trueness, precision and accuracy. Trueness is defined as the closeness of agreement between the average value of a large number of measurement results and an accepted reference value (International Organization for Standardization [ISO], 1994). In other words, trueness is the systematic difference between the 'true' value, X , and the mean value of a large number of measurements, represented by S in Equation (1). When F I G U R E 1 Systematic (S) and random error (with standard deviation σ) in a measurement (Y ). Each measurement result deviates from the 'true' value, X, because of error.
Y refers to the average of a large number of measurements on the same sample. Figure adapted from Hibbert (2007) the trueness of measurement results is low, this means that there is a large systematic error or bias.
In method validation schemes, the random error, ε, characterizes the precision of repeated measurements. The ISO (1994) defines precision as the closeness of agreement between independent test results that were obtained under stipulated conditions. Precision includes terms such as repeatability and reproducibility (Ebentier et al., 2013;Ellison, Barwick, & Farrant, 2009). Repeatability refers to the best precision a single laboratory can obtain and can be calculated from repeated measurements in comparable conditions, that is, the same analyst, on the same day, using the same instruments. Reproducibility is a measure of agreement between repeated measurements on the same samples, obtained by the same method, under different conditions. Measurement conditions can differ within and between laboratories, resulting in within-and between-laboratory reproducibility (Ellison et al., 2009).
Both trueness and precision are used to determine the quality performance characteristics of the measurements: the accuracy (Menditto, Patriarca, & Magnusson, 2007). Accuracy can be defined as the closeness of agreement between a single measurement result and the accepted reference value, which is considered as the 'true' value (ISO,1994). Thus, accuracy relates to the sum of S and ε i in Equation (1). The accuracy of a result is used to determine the degree of confidence that one has in that particular measurement. A high accuracy can only be achieved when trueness is high, meaning that the systematic error (S) is small, and when the precision is high, meaning that the standard deviation, σ, of the random error is small.
The systematic error can include multiple components, for instance, the matrix variation effect, batch bias, laboratory bias and method bias (Thompson, 2000). For measurements taken in the same batch, we can assume that conditions are more similar. Bias in a single batch will systematically affect all measurements in that particular batch because conditions are the same, such as the analyst and the instrument. The same applies to a laboratory. Each laboratory applies the same method but is likely to have its own interpretation of the method protocol, leading to a laboratory bias. Whenever multiple methods are applied to measure a certain soil property, each method has its own bias. In the next section we developed these error sources more formally using a statistical modelling approach.

| Linear mixed-effects model
We assume that errors in analytical data result from varying measurement conditions between batches and laboratories but also from method bias, as explained in Sections 1 and 2. We can include identified error sources in a linear mixed-effects model. Such a model partitions the total variance in the dataset into components and determines the relative contribution of each variance component (Zuur, Ieno, Walker, Saveliev, & Smith, 2009). The linear mixed-effects model includes both fixed and random effects as predictor variables. Fixed effects are group specific (i.e., the error of each group or component), whereas random effects inform on the variation between groups. Essentially, the systematic and random errors from Equation (1) are broken down into multiple components. Here, we assume that variance in measurements is associated with true soil variation (characterized by sample ID, a fixed effect), and the batch, laboratory and sample preparation method (random effects). Residual variance that cannot be attributed to these components is regarded as the random error.
We included these fixed and random effects in the following linear mixed-effects model: where Y pqrst is the t-th measurement of the s-th soil sample of the p -th batch, performed by the q -th laboratory while using the r -th preparation method. B pqr represents the random effect of the p-th batch in the q-th laboratory using the r-th preparation method, L q represents the random effect of the q -th laboratory, M r represents the random effects of the r -th preparation method, X s is the 'true' value of a soil sample (modelled as a fixed effect) and ε pqrst is the random error. Based on the theory of linear mixed-effects models, we assume that B pqr , L q , M r and ε pqrst are uncorrelated and normally distributed with zero mean, and standard deviations σ batch , σ laboratory , σ method and σ residual , respectively.

| Parameter estimation
In this research, the linear mixed-effects model was used to model measurement error in wet chemistry soil data. In an ideal world, wet chemistry soil datasets would be balanced and include replicate measurements for all soil samples. Having a balanced dataset structure means that the number of observations per combination of factor levels is equal. In other words, multi-laboratory soil data are balanced when each laboratory measured the same soil samples the same number of times, and included an equal number of samples in an equal number of batches. However, data delivered by multiple laboratories are seldom balanced, as the number of analyses is highly dependent on available material, time and financial resources. Furthermore, repeated measurements on soil samples are required to accurately quantify the random effects included in the model. The batch effect can only be estimated when the same soil sample is analysed in different batches. The same applies to the laboratory effect. Furthermore, sufficient replicate measurements per batch should be available to correctly estimate the residual variance. However, replicate measurements of all soil samples are costly and are often only collected for a small subset, for example, as part of the laboratory's quality control programme.
Therefore, the accuracy of model parameter estimates is dependent on the experimental measurement design, namely the number of laboratories and batches, and the number of replicates taken. A higher number of replicate measurements means that more information is available to accurately estimate the model parameters, in particular the residual variance. Whenever insufficient information is available, that is, in case of (near-)singularity, the model will become unstable and model parameter estimates will become highly uncertain, or cannot be estimated at all.
In the case studies of this research, restricted maximum likelihood (REML) was used to estimate the variance components of the random effects, as well as the fixed effects (Lark & Cullis, 2004;Webster & Oliver, 2007;Webster, Welham, Potts, & Oliver, 2006). For Gaussian models, REML is known to produce less biased parameter estimates when compared to maximum likelihood (ML) (Harrison et al., 2018).

| Software implementation
The linear mixed-effects model was fitted on synthetic data (Section 4) and real-world data (Section 5) in R Studio (R Core Team, 2017) using the lmer function from the lme4 package (Bates, Mächler, Bolker, & Walker, 2015). In the lmer syntax, the model was expressed through a formula including both the fixed and random effects. Here, we assumed that hierarchy was present in the grouping of the observations, by the grouping variables batch and laboratory. Every laboratory groups a number of unique batches, making batches nested within laboratory. All control arguments in the lmer function kept their default value. The R scripts of the synthetic case were made publicly available via GitLab (van Leeuwen, Mulder, Batjes, & Heuvelink, 2021).

| Data
As explained in Section 3, we can distinguish between many different error sources, which result from varying measurement conditions between batches and laboratories. However, we would also like to quantify these errors. To do so, an experimental measurement design is required that allows reliable estimation of the variance associated with each error source. In this section, we used synthetic datasets to estimate these variances. Using synthetic data allowed us to compare various experimental measurement designs. Furthermore, because we were in control of generating the data, we knew the variances associated with all error sources. Therefore, we could evaluate how well they were estimated from the data.
Five synthetic datasets were generated, consisting of n ¼ 20,50, 100, 200 and 500 sample IDs. The datasets included synthetic 'true' pH H 2 O values for each sample ID, which were drawn from a normal distribution with a mean of 6.5 and a standard deviation of 1 pH unit. Each soil sample was 'measured' in duplicate, distributed over four batches. This set-up was repeated for three hypothetical laboratories. In the case of n ¼ 100 sample IDs, this experimental measurement design resulted in 600 (100 Â 3 Â 2) synthetic pH H 2 O measurements in total, distributed over 12 batches ( Figure 2). The batch effect was taken to have a variance of 0.01 σ batch ¼ 0:1 ð Þ , whereas the laboratory effect had a variance of 0.0625 (σ laboratory ¼ 0:25 ). The residual variance was taken as 0.04 (σ residual ¼ 0:2). Two thousand repetitions were made for the five datasets to study the effect of sample size and experimental measurement design on the estimation of the variance components. The 'true' pH H 2 O of a soil sample, as well as the random effects, were simulated for each of these 2000 repetitions. The lmer function was applied to each of the 2000 repetitions, after which the estimated variance components were stored. We thus obtained 2000 estimates of each variance component and could compare these with the 'true' variances used to simulate the data. Note that we did not generate data with different preparation methods. Therefore, M r was excluded from the model in Equation (2). All simulations from normal distributions were performed using the rnorm function in R.
To analyse the effect of missing data on model parameter estimation, 20, 50 and 80% of the data were randomly removed from the original datasets, using the sample function from the R Base package. The effect of various percentages of missing data was studied through comparing the interquartile ranges (IQR) of the model parameter distributions. We used the IQR instead of the van LEEUWEN ET AL. standard deviation to evaluate the spread of the model parameter estimates, because some of the distributions were skewed. In such a case, the IQR is easier to interpret.

| Balanced data
The linear mixed-effects model was fitted on the five different synthetic datasets (Table 1; Figure 3). For all sample sizes, the average of the estimated variances for the random effects was close to the 'true' values. This indicates that REML estimation of the variance components was unbiased. Some modest variation could be observed in the mean estimated laboratory effect variance, which ranged from 0.061 to 0.065. As expected, the IQR of the model parameter estimates decreased with increasing n, meaning that the estimates became more accurate as the size of the dataset increased (Table 1). However, this effect was limited for batch and laboratory effect, with a decrease in IQR of 23.3 and 4.6%, respectively, when increasing from n ¼ 20 to n ¼ 500. The IQR of the residual variance estimates dropped by 80% between n ¼ 20 and n ¼ 500. Furthermore, the IQRs of the batch and laboratory effect were almost equal to the variance estimates. For example, for n ¼ 100, σ 2 batch ¼ 0:010 , whereas the corresponding IQR was 0.0086. In contrast, the IQR of the residual variance estimates was quite small compared to the residual variance estimate itself. Figure 3 shows the density plots of the estimated model parameters. For batch and laboratory effect variance, the distributions are right skewed. Furthermore, the laboratory effect variance shows a large mass at zero, meaning that approximately 11% of the laboratory variances were estimated to be zero. Therefore, the laboratory effect has a mixed discrete-continuous distribution (Weld & Leemis, 2019), indicated by the discrete spike with solid disk in Figure 3. The probability density estimates for the residual variance were fairly symmetric. Their distribution became significantly narrower when F I G U R E 2 Nested structure of the synthetic dataset, here including three hypothetical laboratories, 12 batches and n ¼ 100 sample IDs more sample IDs were included in the dataset. Correlations between estimates of individual random effects were close to zero (Appendix S1).

| Unbalanced data
After removing 20, 50 and 80% of the data, the means of the model parameter estimates over all 2,000 repetitions were again similar to the 'true' parameter values (results not shown). The IQRs of the estimates for batch effect and residual variance clearly increased with a higher percentage of randomly removed data ( Figure 4). The IQR of the laboratory effect variance behaved more randomly. In general, the IQRs of n ¼ 20 , 50 , 100 and 500 increased when comparing 0 and 80% of removed data. However, the IQR values at 20 and 50% removed data behaved randomly. Furthermore, the IQR values based on the n ¼ 200 dataset were highly fluctuating, showing no distinct relationship between the IQR and the percentage of randomly removed data. Removing 80% of the data resulted in a larger IQR of the batch effect and residual variance for the n ¼ 20 dataset compared to the n ¼ 500 dataset. For example, the IQR of the batch effect variance of the n ¼ 20 dataset increased by 171%, whereas the IQR of the n ¼ 500 dataset increased by only 15%. The same was observed for the IQR of the residual variance. For both cases, the effect of missing data is largest for the smallest dataset, n ¼ 20.

| Discussion
In this section, we explored the ability of the proposed model (Equation 2) to estimate the random effects, σ 2 batch , σ 2 laboratory and σ 2 residual , while using different experimental measurement designs. We expected that the width of the distribution of the estimated variances for each of the random effects would decrease with increasing n, as more data were available. The distributions of the estimated σ 2 batch and σ 2 laboratory were both right skewed, whereas that of σ 2 residual was symmetrical. The mean estimates of batch effect and residual variance were similar to the 'true' values, indicating unbiased model estimates for all n. The laboratory effect showed some variation in the mean variance estimates, ranging from 0.061 to 0.065, whereas the 'true' value was 0.0625. This minor variation around the 'true' value could be attributed to the finite number of repetitions (2,000) we performed. Increasing the number of repetitions, to for example 10,000, is likely to lead to more stable estimates.
The IQR of the estimated variances decreased with increasing n. However, this effect was only clear for the residual variance, where the IQR decreased by 80% when comparing n ¼ 20 to n ¼ 500. The large number of replicates available in the n ¼ 500 dataset simply led to more precise residual variance estimates, shown by the small IQR. For batch and laboratory effect, the decrease in IQR when increasing n from 20 to 500 was less marked, with a reduction of only 23.3 and 4.6%, respectively. The limited decrease of the IQR could be explained by the F I G U R E 3 Density plots of the estimated batch, laboratory and residual variances for the five synthetic datasets, computed over 2,000 repetitions. The dashed line indicates the 'true' variance. As the laboratory effect variance had a mixed discrete-continuous distribution with a probability mass at zero, the discrete part is indicated by the black spike together with the probability mass represented by a solid disc. Note that the scales of the x-and y-axes differ available information to estimate the random effects. For batch effect, the datasets contained information of 12 batches, whereas for the laboratory effect, data from only three laboratories were included. In other words, increasing n did not increase the number of laboratories and batches. For example, for the n ¼ 500 dataset, 500 (sample IDs) Â 2 (replicates) Â 3 (laboratories) yielded 3,000 measurements. However, from these data, groups of 1,000 measurements had the same laboratory error, meaning that overall little information was available to estimate the laboratory effect. As only limited laboratory information was available, the model parameter estimates were less precise.
After determining the random effects for balanced datasets, we explored how the model parameter estimates were influenced by different unbalanced experimental measurement designs. We expected that for each n , the IQR of the estimated variances for the random effects would increase with a higher percentage of missing data. Furthermore, we also expected that the IQR values would be lower and show a gentler increase in the case of a larger n . This effect could be explained by the fact that more observations remained for n ¼ 500 after removing 80% of the data, compared to when the same percentage was removed from the n ¼ 200 dataset. Our expectations were in line with the observed IQRs for the batch effect and residual variance, where the difference in IQR for 0 and 80% removed data was largest for n ¼ 20 (171% increase in batch effect variance IQR). For comparison, the batch effect IQR for n ¼ 500 increased by only 15% when 80% of the data were removed. The relation between IQR and the size of n was less distinct for the laboratory effect. Randomly removing data from all datasets led to highly fluctuating IQRs. These fluctuations are likely to be reduced by increasing the number of repetitions, for example, from 2,000 to 10,000. Thus, they are a consequence of the approximation errors caused by the numerical evaluation of IQRs. However, the size of the IQRs was, again, mainly influenced by the limited number of laboratories present in the datasets. Increasing the number of laboratories is likely to result in a smaller IQR of the laboratory effect, which would show an increase when removing data randomly.
The use of synthetic datasets to develop and test the model allowed us to study the influence of different experimental measurement designs on model parameter estimates. This application is especially interesting for laboratories that aim to quantify and reduce measurement error caused by the various error components. The model could be used by individual laboratories to determine the minimal number of replicate measurements to be included in the experimental measurement design. To illustrate this, consider a case with batch and laboratory effects removed, so that all variance in the data is captured by the residual part of the model. In this simplified case, results may also be obtained analytically (and perhaps also for more complex cases, but numerical approaches may be more attractive in such instances). Because we assumed normal distributions for the errors, the variance estimator has a chi-squared distribution (Webster & Oliver, 2007, Section 2.6.5), from which the IQR can be derived using the qchisq function in R (van Leeuwen et al., 2021). F I G U R E 4 Interquartile range (IQR) of estimates for batch effect, laboratory effect and residual variance (n ¼ 20, 50, 100, 200 and 500). For each n, 0, 20, 50 and 80% of the data were removed randomly, before fitting the linear mixed-effects model (2,000 repetitions) Figure 5 shows the results of the analytical solution for synthetic pH H 2 O data with σ residual = 0.2, 0.3 and 0.4. For each σ residual , the IQR was determined for when 5, 10, 20, 50, 100, 200 and 500 replicates were available. The IQR decreased between 26 and 29% whenever the number of replicates was doubled. In other words, the IQR decreased approximately inversely proportional to the square root of the number of replicates included in the experimental measurement design.
In future research, the effect of different numbers of batches and laboratories could be investigated. Furthermore, in the current model (Equation 2), we assumed that only σ laboratory was systematically different for each laboratory. However, the batch and residual standard deviations may also vary between laboratories, meaning that each laboratory has its own σ batch and σ residual . Adding this effect would result in a more complex model with more parameters, which are more difficult to identify and hence lead to larger IQRs with the same experimental measurement design. Another effect that could be added to the model is a proportional residual variance. For instance, in soils with a high organic carbon content (e.g., forest soils) measurements tend to have a larger error. In this case, the residual variance is proportional dependent on the TOC content of the sample.

| Data
For the real-world case study, measurements of soil samples from WEPAL were used. These data are part of the International Soil-analytical Exchange (ISE) programme (WEPAL, 2020). TOC measurements (dry combustion) were included in the analysis to determine if the model was also able to precisely estimate the variances of the random effects for a soil property different from pH H 2 O . TOC is considered to be more difficult to measure than pH H 2 O and its analysis is likely to yield more uncertain results (Bisutti, Hilke, & Raessler, 2004). Here, TOC is expressed in mass percentage (%).
Soil samples were selected based on their generic soil characteristics, aiming for a diverse range of materials to be included in the analysis (Table 2). Furthermore, we required that each soil sample was included in at least two rounds of the ISE programme. This condition ensured that multiple measurements of the same soil sample by the same laboratory were available.
The pH H 2 O and TOC data contained batch and laboratory information. Furthermore, in the proficiency testing scheme, each homogenized soil sample was sent to participating laboratories over multiple rounds. Therefore, a round effect might be present, which was added as a random effect in the linear mixed-effects model. We expected the round variance to be close to zero, as WEPAL ensures the homogeneity of soil samples over rounds. The original model (Equation 2) was rewritten into: where the preparation method effect, M , was excluded, and the round effect, R, was added. Y pqrst is the t-th measurement of the s -th soil sample of the r -th round, measured in the p -th batch by the q -th laboratory. For pH H 2 O , the final dataset included measurements of all nine soil samples (see Table 2), which were analysed by 332 laboratories over a maximum of four rounds. The 6,749 measurement results were spread out over 6,380 batches, thus including 369 replicates. The TOC dataset contained only 1,756 observations that were analysed by 154 different laboratories. Here, only one batch contained replicate measurements. Consequently, this made quantifying the batch effect impossible, because the model could not distinguish between batch effect and residual variance. Therefore, batch effect was removed from the model (Equation 4) before fitting it to the TOC dataset: We also used the selected WEPAL data to test if the linear mixed-effects model (Equations 3 and 4)

| pH-in-water
The linear mixed-effects model was fitted on the WEPAL pH H 2 O measurements. The model estimated σ 2 round ¼ 0:0015 , σ batch ¼ 0:047 , σ 2 laboratory ¼ 0:029 and σ 2 residual ¼ 0:036 . The round effect variance was close to zero. Furthermore, the batch effect was greater than the laboratory effect. The estimated residual variance was smaller than the batch variance, but larger than the laboratory effect variance. The fitted values, based on the WEPAL pH H 2 O measurements, were used as input, or 'true', values for the random effects in the simulation study. Figure 6 shows the distributions of the estimated model parameters, based on 500 repetitions. The variance estimates of all four parameters were fairly normally distributed and centred around the 'true' variances, which were based on the model estimates from the real WEPAL data. The IQRs of the estimated variances were 0.0006 (round effect), 0.0037 (batch effect), 0.0039 (laboratory effect) and 0.0035 (residual). For all random effects, the IQRs were relatively small, suggesting that the model was able to accurately estimate the random effects in WEPAL's pH H 2 O measurements.

| TOC
The linear mixed-effects model estimated σ 2 round ¼ 0:21% 2 , σ 2 laboratory ¼ 7:71% 2 and σ 2 residual ¼ 32:90% 2 . As for pH H 2 O , the round effect was negligibly small, but for TOC the residual variance was substantially larger than the laboratory effect variance. These 'true' values were again used in a simulation study (Figure 7). In 19.8% of the simulations, the round effect variance was estimated at exactly zero, indicating a mixed discrete-continuous distribution (Weld & Leemis, 2019). The IQRs of the estimated variances were 0.35 % 2 (round effect), 1.80 % 2 (laboratory effect) and 1.54 % 2 (residual). These were relatively large compared to the estimated variances, especially for the round and laboratory effect.

| Discussion
In Section 5.2, the model was fitted to WEPAL's pH H 2 O (Equation 3) and TOC (Equation 4) data. For both soil properties, the model estimated the round effect variance at almost zero (pH: σ 2 round ¼ 0:0015, TOC: σ 2 round ¼ 0:21 % 2 ). These low values indicate that the variance as a result of heterogeneity of the sample material was minimal. In other words, the composition of the soil sample did not change over rounds, which was expected as WEPAL guarantees homogeneity of provided test material. Furthermore, for pH H 2 O the model estimated that the batch effect variance was larger than that of the laboratory effect (σ 2 batch ¼ 0:047 and σ 2 laboratory ¼ 0:029). We expected that the batch effect would be smaller than the laboratory effect, because laboratories use their own specific procedures and within-laboratory variability is typically smaller than between-laboratory variability. Having a batch effect larger than the laboratory effect could be related to the structure of the testing scheme WEPAL applies; every soil sample is distributed over the participating laboratories between one and four times. However, not every laboratory returns all measurement results. In most cases, only a single measurement result was returned per round per laboratory. Only 369 out of 6,380 batches (5.8%) contained a replicate measurement. As a result, the model may have had difficulty in distinguishing between the batch effect and the residual part of the model, leading to a higher batch effect variance and thus an underestimated residual variance. The results from the simulation study supported this assumption. Here, batch effect and residual variance showed a strong negative correlation (À0.839), whereas correlations between the other random effects were close to zero ( Figure 8). The TOC batch effect variance was captured within the residual variances, because the TOC dataset contained only one replicate measurement. Therefore, the batch effect was removed from the model. Correlations between random effect estimates were all close to zero (Appendix S1).
The model parameter estimates of the 'true' values were in agreement with the WEPAL data. For pH H 2 O , Table 2 showed a mean variance of 0.103 (σ ¼ 0:32 ), whereas the model estimated the total variance at 0.114 (σ ¼ 0:34). Similarly, the TOC data had a mean variance of 38:88% 2 σ ¼ 6:24% ð Þ , whereas the model estimated a total variance of 40:82% 2 (σ ¼ 6:39%). Table 2 shows that the standard deviation for TOC differed substantially between sample IDs. For example, sample 866 had a mean TOC of 10.09% and a standard deviation of 2.71%, whereas sample 867 had a mean TOC of 51.08% with a standard deviation of 10.15%. With higher mean TOC, the standard deviation increased, indicating a proportional effect. Therefore, in future research, this effect may be included by assuming that the residual standard deviation is proportional to the mean.
Results from the simulation study showed that the average model parameter estimates were close to the 'true' values. For pH H 2 O , the IQRs of the variance estimates were small, indicating that, despite the small number of replicate measurements (369 of 6,749 observations), the model was able to accurately estimate the random effects. In contrast, the relative IQRs for the model parameter estimates were larger for TOC, especially for the round and laboratory effect. This could be attributed to the larger laboratory bias in the TOC data compared to pH H 2 O data. Furthermore, the TOC dataset was much smaller than that of the pH H 2 O (1,756 and 6,749 observations, respectively), and included only a single replicate measurement, whereas the pH H 2 O dataset contained 369 replicates. 6 | GENERAL DISCUSSION 6.1 | Quantification of uncertainties in wet chemistry soil data The quantification of uncertainties in wet chemistry data is of utmost importance, as errors in these data can propagate in further applications, such as pedotransfer functions or spectroscopic models (FAO, 2019b). Unfortunately, the majority of wet chemistry soil data is provided without uncertainty estimates, thereby limiting the scope for users to assess the "fitness for intended use". Most databases that are compilations of legacy data as well as spectrally derived data from various sources, do not provide quality indicators along with the data. In this study, we evaluated how measurement uncertainties can be estimated using replicates and a linear mixed-effects model approach.
F I G U R E 7 Density plots of model parameter estimates for the round and laboratory effects, and the residual variance, based on the simulated WEPAL TOC dataset (500 repetitions). As the round effect variance followed a mixed discrete-continuous distribution with a mass at zero, the discrete part is indicated by the black spike together with the probability mass represented by a solid disc. The dashed vertical line indicates the input, or 'true', variance of the random effects. Note that the scales of the x-and y-axes differ We quantified uncertainties in pH and TOC measurements through a linear mixed-effects model approach. In other disciplines, such as ecology and medicine, (generalized) linear mixed-effects models are already commonly used to estimate variances from multiple sources in repeated measurements (Zuur et al., 2009). However, for wet chemistry soil data, the measurement error is often expressed by the mean squared error (MSE) or the root mean square error (RMSE) (Lagacherie et al., 2019;Libohova et al., 2019). These indicators are just summary measures, which do not provide information about the uncertainty associated with different components of the measurement procedure. Linear mixed-effects models allow determination of the contribution of each variance component to the total variance in the measurement. This information can help data providers decrease variance in their measurements by improving the limiting components in the laboratory analysis process itself.
Several studies also determined the size of the measurement error in basic soil properties, such as pH and TOC (e.g., Laslett & McBratney, 1990;Libohova et al., 2019;Pribyl, 2010). However, different approaches, alternative terminologies for the error and the use of different units made it challenging to compare our findings with those of previous studies. Furthermore, one should pay attention to what is included in the estimate of the measurement error. For example, Rawlins, Lister, and Mackenzie (2002) mention that, in practice, subsample and analytical variance cannot be separated, because the analytical variance can only be estimated by repeated analyses on the same uniform soil sample. However, as soil is never totally uniform, part of the analytical variance should actually be attributed to the subsampling error. In this research, we did not include subsampling error as a separate random effect. However, if this error source was present in the data, it was captured within the residual variance. Libohova et al. (2019) assessed the size of errors in pH measurements, originating from the use of multiple wet chemistry methods, pedotransfer functions and spatial interpolation procedures, from different US databases. For pH H 2 O , Libohova et al. (2019) found a RMSE of 0.19 for the laboratory repeatability and 0.34 for the within-laboratory reproducibility. The betweenlaboratory reproducibility was estimated to have a RMSE of 0.50. Variance estimates were also provided for the laboratory repeatability and within-laboratory reproducibility; 0.15 and 0.06, respectively. Another study summarized analytical error variance estimates for a set F I G U R E 8 Scatter plots and Pearson correlations of the random effects variance estimates for pH H2O van LEEUWEN ET AL. of soil properties from literature (Viscarra Rossel & McBratney, 1998). A mean variance of 0.0295 for pH H 2 O was obtained. Both values are relatively small compared to the total variance estimated in the current research, where pH H 2 O had a mean total variance of 0.114 in the real-world case study. The data collected from the literature by Viscarra Rossel and McBratney (1998) originated from multiple laboratories, which all analysed the same soil sample. However, the study included only a small number of soil samples and variances were reported per soil sample. These examples from the literature show that comparison of measurement errors between studies is difficult, pointing to the need for a consistent procedure for quantifying the uncertainty in wet chemistry soil data.

| Implications for users of soil data
We applied the model to real-world soil data, provided by WEPAL. For pH H 2 O data, the overall measurement error was estimated at σ = 0.34, whereas σ = 6.39% was estimated from the TOC data. These measurement errors are relatively large, especially for TOC.
For data users, these findings are important to determine the "fitness for intended use" of the data. Negative effects can occur when using uncertain data to, for example, provide nutrient or fertilizer recommendations. For example, in the United States, applications of lime to acidic soils to raise soil pH are based on a recommendation scheme that does not take the uncertainty in measured pH into account, potentially leading to additions that are either too low or too high (Libohova et al., 2019). Target pH values in such nutrient application schemes are categorized per 0.2 pH units (Laboski, Peters, & Bundy, 2006). For these purposes, the estimated random effect standard deviations (σ batch ¼ 0:22 , σ laboratory ¼ 0:17 and σ residual ¼ 0:19 ) could potentially lead to incorrect recommendations. TOC, a common measure for SOC (soil organic carbon), is often used to assess SOC stocks and carbon sequestration rates at multiple spatial scales (Francaviglia, Di Bene, Farina, Salvati, & Vicente-Vicente, 2019). For accurate estimates of carbon sequestration rates and SOC stock changes, precise and repeated measurements of SOC are required (Stockmann et al., 2013). Uncertainty of SOC measurements will lead to uncertainty of the SOC stock estimation. For the TOC data, the total standard deviation was estimated at 6.39% (mean TOC = 28.9%), which is a relative error of 22.1%. A relative error of 22.1% in a TOC measurement will thus lead to a relative error of 22.1% in the SOC stock estimate, besides the error that occurs from measurement error in bulk density and coarse fragments data.

| Recommendations for providers of soil data
We applied the model to real-world soil data, provided by WEPAL. Contrary to our expectations, the batch effect variance was larger than the laboratory effect variance for pH H 2 O . The model estimated σ 2 batch ¼ 0:047 and σ 2 laboratory ¼ 0:029 . However, as explained in Section 5.3, the small number of batches with replicate measurements, and the strong negative correlation between the batch effect and residual variance, suggested that this experimental measurement design was not appropriate for accurately estimating the batch effect and separating it from the residual variance. Therefore, when quantifying the batch effect, more batches should contain replicate measurements.
The synthetic case study yielded insight on the influence of experimental measurement designs and number of replicates on the accuracy with which error source components could be estimated. The number of replicates influenced the accuracy of the estimate of the residual variance, as illustrated by the IQRs (Table 1). For n ¼ 20, the IQR of the residual variance was 500% larger than the IQR for n ¼ 500 (0.0080 and 0.0016, respectively). This change is not caused by the number of samples, but by the number of replicates, as was also demonstrated in Figure 5. The IQR was inversely proportional to the square root of the number of replicates ffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 500=20 p ¼ ffiffiffiffiffi 25 p ¼ 5 . This effect was less distinct for the batch effect, where the IQR decreased from 0.0086 (n ¼ 20) to 0.0066 (n ¼ 500). The laboratory variance IQR remained more or less the same (0.0736 and 0.0702 for n ¼ 20 and n ¼ 500 , respectively). As mentioned in Section 4.3, the IQR of both batch and laboratory effect variance did not decrease for larger n, as the total number of batches and laboratories remained the same. This is important to consider when using the model to assess variance in multi-laboratory datasets. The same applies to any other error source being included in the model. Whenever a limited number of batches and laboratories are included in the dataset, the model estimates for these random effects will become less accurate.
The analytical results from the simplified model demonstrated the importance of including sufficient replicates in laboratory analyses (Section 4.3). A limited number of replicate measurements led to imprecise model parameter estimates, regardless of the number of samples. The analytical tool can be used by laboratories to determine how many replicates should be included in their experimental measurement design. Of course, this decision is also based on the financial resources that are available for the analyses. Considering variance estimation before performing analyses, could greatly improve uncertainty quantification of wet chemistry measurements. These findings are relevant to initiatives such as GLOSOLAN (FAO, 2019a). The importance of having sufficient replicate measurements was also observed in the real-world case study. For pH H 2 O , only 5.8% of the batches included replicate measurements. The lack of sufficient replicates in each batch caused the model to have difficulties in distinguishing between the batch effect and residual variance.
Furthermore, as demonstrated in the synthetic case study with unbalanced datasets, the number of sample IDs included in the analyses indirectly affected model parameter estimates. Here, imprecise model parameter estimates were observed for the smaller datasets (n ¼ 20 and n ¼ 50) after 80% of the data was removed. In these smaller datasets, fewer replicates were available. When removing 80% of the existing data for n ¼ 20 , 96 of the total 120 measurements were removed. The remaining 24 observations contained too little information to accurately estimate the three random effects. In larger datasets more data remained available to estimate the model parameters after removing a large percentage of the data.
Additionally, as suggested by the literature review, when quantifying uncertainties in wet chemistry soil data, SOPs for measuring and reporting data are of utmost importance. Standardized methods will reduce σ batch , σ laboratory , σ method and σ residual . The positive effect of SOPs on the laboratory measurement error is illustrated by the results from the real-world case study. WEPAL applies SOPs for preparing homogeneous soil samples and subsequent storage of the material, leading to a small round effect variance, as presented in Section 5.2.

| CONCLUSION
We aimed to quantify uncertainties in synthetic and realworld wet chemistry soil data through a linear mixed-effects model approach. Our study showed that for balanced and unbalanced datasets, using data for three hypothetical laboratories (four batches per laboratory), the mean estimated variances of the random effects were in agreement with those for the respective random effects used to generate the synthetic datasets. In other words, there was no systematic bias in the model variance estimates.
The results from the synthetic case study also showed the importance of including sufficient batches and laboratories in the experimental measurement design when quantifying uncertainties in multilaboratory data. Including data from only three hypothetical laboratories led to imprecise estimation of the laboratory random effect, which did not improve when more unique samples were added. A similar effect was observed for the batch effect variances, although less strong. To solve this problem, it may be better to represent the laboratory effect as a fixed effect instead of a random effect in case of data from only a few laboratories or batches. In contrast to the batch and laboratory variance estimates, the residual variance estimates did become more precise, as indicated by the decreasing IQRs. In the unbalanced case, the IQRs increased after various percentages of data were removed randomly. The effect was clearest for smaller datasets, where removing observations left only little information for the model to estimate its parameters.
The real-world case study using WEPAL data showed that the model could also accurately estimate the variances of the random effects using real unbalanced datasets. For pH H 2 O , the model estimated σ 2 batch ¼ 0:047, σ 2 laboratory ¼ 0:029 and σ 2 residual ¼ 0:036 . However, due to the small number of batches with replicate pH H 2 O measurements (5.8%), the model had difficulties in distinguishing between batch and residual variance. For TOC, only a single replicate measurement was available, leading to batch effect being removed from the model. The model estimated σ 2 laboratory ¼ 7:71% 2 and σ 2 residual ¼ 32:90% 2 . For both pH H 2 O and TOC, the round effect variance was close to zero, indicating that WEPAL successfully distributes stable and homogeneous soil samples between rounds of the ISE Programme.
The results from the synthetic case also demonstrated the importance of having sufficient replicate measurements. Therefore, in laboratory performance comparisons, a greater number of batches should contain replicate measurements to successfully estimate the variance components. The analytical result of the simplified case with only residual error ( Figure 5) showed that the IQR of the variance estimates decreased proportionally with the square root of the number of replicates included. Data providers can use these types of analysis to determine how many replicates should be included in their experimental measurement design, thus striking a balance between accurate measurement uncertainty quantification and financial resources.
This research demonstrated the importance of adequate experimental measurement design and sufficient replicate measurements in the quantification of uncertainties in wet-chemistry soil data. Furthermore, the results showed that the laboratory measurement error in soil data is quite large and should not be ignored by the users of the data, which, unfortunately, occurs regularly. the two anonymous reviewers for their constructive and insightful comments.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon reasonable request. Restrictions apply to the availability of the WEPAL data.