Diffuse reflectance spectroscopy for estimating soil properties: A technology for the 21st century

Spectroscopic measurements of soil samples are reliable because they are highly repeatable and reproducible. They characterise the samples' mineral–organic composition. Estimates of concentrations of soil constituents are inevitably less precise than estimates obtained conventionally by chemical analysis. But the cost of each spectroscopic estimate is at most one‐tenth of the cost of a chemical determination. Spectroscopy is cost‐effective when we need many data, despite the costs and errors of calibration. Soil spectroscopists understand the risks of over‐fitting models to highly dimensional multivariate spectra and have command of the mathematical and statistical methods to avoid them. Machine learning has fast become an algorithmic alternative to statistical analysis for estimating concentrations of soil constituents from reflectance spectra. As with any modelling, we need judicious implementation of machine learning as it also carries the risk of over‐fitting predictions to irrelevant elements of the spectra. To use the methods confidently, we need to validate the outcomes with appropriately sampled, independent data sets. Not all machine learning should be considered ‘black boxes’. Their interpretability depends on the algorithm, and some are highly interpretable and explainable. Some are difficult to interpret because of complex transformations or their huge and complicated network of parameters. But there is rapidly advancing research on explainable machine learning, and these methods are finding applications in soil science and spectroscopy. In many parts of the world, soil and environmental scientists recognise the merits of soil spectroscopy. They are building spectral libraries on which they can draw to localise the modelling and derive soil information for new projects within their domains. We hope our article gives readers a more balanced and optimistic perspective of soil spectroscopy and its future.


| INTRODUCTION
The recent Opinion article in this journal by McBride (2022) reviews the science underlying diffuse reflectance spectroscopy for determining concentrations of chemical constituents in soil. In the article, McBride criticises the techniques and the exaggerated claims made by some of its exponents. Some of the comments made by McBride are fair. The reflectance spectra of soil in the visible (vis; 400-700 nm), near-infrared (NIR; 700-2500 nm), and mid-infrared (MIR; 2500-20,000 nm) are influenced simultaneously by many soil components, each absorbing electromagnetic radiation at characteristic frequencies. Soil spectra are more complex than those of many other materials, making it more difficult to analyse and draw inferences from them. We should not readily expect to obtain accurate estimates of constituents that lack spectral response, which according to McBride, many practitioners seem to assume. Regression, whether based on ordinary least squares, more elaborate modelling or machine learning, is unlikely to provide reliable predictions of target variables that are only weakly correlated with the spectra. All modelling can be risky and suffer from over-fitting unless appropriately used and validated. Spectroscopy in these circumstances is no panacea; it is no certain substitute for well-established soil chemical, physical or biological analysis. We agree with McBride on these matters. With only a superficial review of the literature, we might find published studies that would have readers believe otherwise.
There are, however, important aspects of McBride's paper with which we disagree. The article is published as an 'Opinion', and we contend that this opinion is based on a partial and somewhat outdated appreciation of science and rapidly advancing technologies. We also contend that McBride has failed to appreciate the circumstances in which reflectance spectroscopy has merit. We elaborate on these aspects below.
2 | SOIL SPECTROSCOPY CAN BE 'TRUSTED' McBride (2022) states that the spectroscopic method cannot be 'trusted' because of the 'indirect' relationship between the soil properties and the spectra. Using such a broad statement to describe soil spectroscopy is wrong. Diffuse reflectance spectroscopy is a well-established quantitative method long used in physical and analytical chemistry because atoms and molecules absorb radiation in specific wavelengths and have their own unique spectra (McClure, 2003;Pasquini, 2018;Workman Jr, 1996). Soil spectroscopy in the vis-NIR and MIR integrates the signals from the soil's minerals, organic matter, and water adsorbed or present in mineral structures (Clark et al., 1990;Nguyen et al., 1991;Viscarra Rossel & Hicks, 2015). The spectra provide a 'fingerprint' of the molecular composition of the soil matrix. Therefore, over the past four decades, research has shown that when soil's physical, chemical, and biological properties derive from or are associated with the mineral-organic matrix, the spectra of air-dry soil can respond to variation in those properties. If the spectroscopic model accounts for the minerals and organic signals in the spectra as well as their interactions, then the method can reasonably accurately estimate the concentration of other constituents (Ben-Dor & Banin, 1995;Soriano-Disla et al., 2014;Stenberg et al., 2010;Viscarra Rossel et al., 2006). Not all constituents are properties of the soil matrix, and those unrelated to it will contribute to the spectra only fortuitously. Any correlation to the spectra will be transient and, although perhaps locally significant, would not generally apply (e.g., Wetterlind et al., 2008).
Spectra obtained from soil under field conditions can contain substantial contributions from water. These are fairly linearly related to the water content (Bowers & Hanks, 1965;Lobell & Asner, 2002), providing an opportunity to develop valid spectroscopic calibrations for soil water . If water is taken into account Minasny et al., 2011;Wijewardane et al., 2016), field spectra can also be calibrated to estimate soil constituents (e.g., Li et al., 2015;Viscarra Rossel et al., 2017). But this research is ongoing, so we leave a discussion of it for now.

| SPECTROSCOPY COMPLEMENTS CONVENTIONAL ANALYSIS
McBride (2022) writes that in many publications on soil spectroscopy, while describing the method's benefits, proponents of spectroscopy have implied that their ultimate goal is to replace conventional soil testing. We have seen the papers and can imagine that the authors, in their enthusiasm, have exaggerated the benefits of soil spectroscopy. Clearly, such claims are wrong because estimates of soil properties rely on empirical calibrations of the spectra with the properties measured conventionally. Nevertheless, we should consider those comments in their written context.
The question of whether spectroscopy could replace conventional analysis was posed some time ago by Janik et al. (1998) who asked if MIR spectroscopy could replace soil extractions. They answered that, for the most part, spectroscopy could not, but stated that spectroscopy adds value to conventional methods and extends them because spectroscopy also helps to understand soil chemistry better. Proponents of spectroscopy understand this and treat it not to replace conventional analysis but to complement it. We need both to increase the data we can assemble to satisfy the growing demand for quantitative information on the soil to support new science and help meet the world's needs for food, fibre, climate adaptation, environmental quality, and sustainable development (Viscarra Rossel & Bouma, 2016).
McBride repeatedly opines that '… spectral reflectance methods are not sufficiently reliable to replace conventional testing'. As above, the aim is not to replace conventional testing. Nonetheless, the statement is misleading; diffuse reflectance spectroscopy is 'reliable' in the sense that repeated measurements are consistently reproducible on any given sample of soil from one occasion to another. We know from a great deal of experience that repeated spectroscopic measurements of a soil sample produce spectra with almost identical mineral-organic absorbances at the same wavelengths , provided, of course, that the soil sample does not change. Perhaps McBride does not mean that spectroscopy is unreliable but that estimates of soil properties are inaccurate because of uncertainties in the spectroscopic modelling. We address this point below. (2022) is that the proponents of spectroscopy have not shown it to be as accurate as conventional methods. Of course, they have not. The reason is that the conventional methods for soil analysis are the de facto standards. Any other techniques that aim to produce the same results are inevitably less accurate because they rely on the calibration of the spectra against the standards and incur statistical error. When developing the calibrations, we assume that the analytical errors of the standard methods are small enough to be negligible. In some circumstances they can be substantial, however (O'Rourke & Holden, 2011;van Leeuwen et al., 2022;Viscarra Rossel & Bouma, 2016), and then those errors will carry over into the calibrations and reduce the accuracy of the spectroscopic estimates.

One of the main criticisms made by McBride
It also does not make sense to compare a spectroscopic estimate with a standard analytical technique on a one-to-one basis. The advantage of soil spectroscopy (and incidentally of other sensing methods, whether proximal or more remote) is that one can make many more measurements, at least an order of magnitude more than by the conventional laboratory analysis, for the same cost. So, the estimation variance of a spectroscopic estimate of the average of some soil properties will be smaller than from conventional analysis. Thus, when the dominant source of error in the individual estimates is random rather than systematic (Guerrero & Lorenzetti, 2021), and when the costs and errors of calibration are taken into account, spectroscopy is cost-effective (Li et al., 2022). Moreover, because its estimation variance is smaller than that of means of fewer replicate measurements obtained by the more expensive standard analytical technique, the estimates are also more informative. The advantages become even more significant when one needs hundreds or thousands of data, for example, for assessments of soil spatial and temporal variation, for mapping (e.g., Ramirez-Lopez et al., 2019;Vågen et al., 2016;, for modelling (e.g., Lee et al., 2021;Lee & Viscarra Rossel, 2020), and in decision-support systems (e.g., Vågen et al., 2018). For those reasons, the Australian Government Emission Reduction Fund (ERF) method for measuring and monitoring soil carbon sequestration 1 is the first to include spectroscopy as a way to improve the accuracy and reduce the cost of measuring and monitoring (England & Viscarra Rossel, 2018;Viscarra Rossel, Lobsey, et al., 2017;Viscarra Rossel & Brus, 2018;Viscarra Rossel, Brus, et al., 2016).
A notable exception to much of the discussion above is when the measurement of the soil property is made more directly from the characteristic absorptions of the soil constituent, for example, measures of soil colour or iron-oxide and clay mineralogy (Clark, 1999;Farmer, 1974;Viscarra Rossel, 2011;. In such cases, the spectroscopic measurements might be at least as accurate as the conventional analysis (e.g., Janik et al., 1995).

| CALIBRATION, REGRESSION AND DIMENSIONALITY REDUCTION
Apart from the few exceptions mentioned, spectroscopic estimates of soil constituents depend on the strength of the relation between the target variables, such as the concentration of organic carbon and the absorbances in the various wavelengths. The spectra produced by modern spectrometers contain thousands or more wavelengths. Therefore, the first task is a multivariate calibration, that is, obtaining an equation from which to predict the concentration of the target constituent from the spectra. Some transformation of the measurement scales is usually required, and preprocessing to linearize the data and remove noise. One might think that, at its simplest, that equation might mean an ordinary least-squares multiple regression. However, we know that with so many predictor variables, multiple regression will suffer from a lack of selectivity of the predictors to the target and redundancy and collinearity among the predictors (Martens & Naes, 1989).
McBride (2022) recognises the problems; they quote Bergstrom & West (2020), who write about the 'curse of dimensionality', first proposed by (Bellman, 1957). However, McBride dismisses the methods for reducing the number of dimensions used to overcome the problem. One popular way is to do a principal components analysis (PCA) of the spectra and then regress the target variable on some of the leading components. The method is called Principal components regression (PCR) (Hotelling, 1957). Partial least squares regression (PLSR) (Wold et al., 1984) is better because it weights the predictors to minimise the prediction variances. McBride refers to these methods as 'data manipulations'. One might regard PCR as such because it makes no assumptions about the relations between the leading components and target variables. But to view PLSR in that way is a mistake; it is a sound form of statistical analysis that optimises predictions and reveals the relative importance of the predictor variables. PLSR is a vital tool in chemometrics (Wold et al., 2001). There are, of course, other ways of dealing with the many multicollinear predictors, but McBride fails to mention any of them. These, too, have been reported in the literature on soil spectroscopy. They include, for example, wavelet multi-resolution analysis (Viscarra Rossel, Behrens, et al., 2016;Viscarra Rossel, Brus, et al., 2016;Viscarra Rossel & Lark, 2009;Vohland et al., 2016) and the variance inflation factor (Song et al., 2021).

| Extrapolations and overfitting
McBride (2022) criticises extrapolation beyond the limits of the data on which the models are built and over-fitting, that is, fitting a model to virtually meaningless information, effectively to noise. They are fair comments, but they are practices that statisticians have warned against for many years; they are not confined to soil spectroscopy.
Unfortunately, included in the burgeoning publications on spectroscopic modelling are many examples of poor practice-of both unwise extrapolation and over-fitting. They seem to arise from a poor understanding of spectroscopy itself, naïve implementation of mathematical and statistical procedures, poor validation practice, and access to computer software on which one needs only to 'press a few buttons'. As McBride shows, these articles are reasonably easy to identify. Unfortunately, McBride fails to recognise the opposite: the many articles that record sound practice. When the modelling is robust (Hastie et al., 2009;Srivastava et al., 2014), and the spectroscopic models are interrogated, interpreted and validated, there is little risk of extrapolation or over-fitting.

| MACHINE LEARNING
The increased capacity and speed of computers in the last 20 years have enabled mathematicians and data scientists to develop algorithms for machine learning. McBride (2022) is critical of these and calls them 'black boxes'. There are indeed examples that report on the spectroscopic modelling with machine learning that lack adequate implementation and interpretation, undermining the confidence in those models. If one only read those reports, there might be some justification for labelling all machine learning as obscure, uninterpretable 'black boxes'.
We agree that machine learning should be used with caution and judged on its performance. We should ask the following questions before using machine learning for spectroscopic modelling. Is it necessary? Even if not, can we use it to our gain? Or is it a 'sledgehammer to crack a nut' with no advantages over more straightforward and transparent regression methods? Whatever we answer, we must recognise that machine learning is here to stay (Heuvelink & Webster, 2022). Further, it will become ever more powerful and potentially more helpful as time passes. If we are to use machine learning for estimating soil constituents with spectra, then we must validate the outcomes with adequately sampled, independent data (Brown et al., 2005;Spiegelhalter, 2019). Further, if we are to understand the outcomes, we must learn how the input variables contribute meaningfully to the predictions. McBride itemises the steps we should take but still lacks confidence in the procedures.
The problem of modelling with highly dimensional data does not affect all machine learning methods, or at least not to the same extent. For example, methods based on regression trees tend to minimise the vector space for predictions and thus prevent the 'curse of dimensionality'. Random forests (Breiman, 2001) create many trees, but only a few variables bear on the target within any one tree. Deep convolutional neural networks (CNN), a more recent innovation, are proving effective for analysing intricate structures in high-dimensional data (LeCun et al., 2015). They can effectively separate the signal from the noise in the highly-dimensional, multi-collinear soil spectra Yang et al., 2020).
Before we conclude this section, we draw attention to the interpretability of machine learning. Interpretability varies depending on the algorithm and how one uses it (Viscarra Rossel & Behrens, 2010). Spectroscopic modelling with regression trees is explainable and interpretable. For example, CUBIST (Quinlan, 1992), a now popular (and well-performing) method for spectroscopic modelling (Viscarra Rossel & Webster, 2012), uses human-readable if-then conditions to split the spectra into coherent 'branches' and fits (well-understood) piecewise multiple linear regressions to the outcomes of each split. We agree that other methods, like support vector machines (SVM) and deep CNNs, can be difficult to interpret because of the complex data transformations or their huge and complex network of parameters. However, there is rapidly advancing research on explainable artificial intelligence (XAI) (Gunning et al., 2019;Savage, 2022).
These developments, based on sound theoretical foundations, are transforming the 'black-boxes' into 'grey-' and even 'white-boxes' (e.g., Chen et al., 2020). The methods are finding applications in soil science and soil spectroscopy. For example, SHapley Additive exPlanations (SHAP) (Lundberg & Lee, 2017), based on game theory (Roth, 1988), has been used successfully for meaningful interpretation in spectroscopic machine learning (Haghi et al., 2021;Zhong et al., 2021).

| SPECTRAL LIBRARIES AND MODEL LOCALIZATION
For decades, soil scientists have been developing databases (or libraries) of soil properties with corresponding spectra for regions of various sizes, from strictly local to national (e.g., Viscarra Rossel & Webster, 2012; Wijewardane et al., 2018), continental (e.g., Stevens et al., 2013), and global (Shepherd & Walsh, 2002;Viscarra Rossel, Behrens, et al., 2016). Many of these libraries were developed by measuring the spectra of soil samples stored in archives as legacies from different experiments and surveys at different scales. The analytical reference data in those libraries often require significant analysis, preprocessing and harmonisation to ensure that they are consistent and of good quality for modelling. Since the soil samples are not collected using a sampling design suitable for developing a spectral library and subsequent modelling, one must also judiciously validate the spectroscopic models to prevent overoptimistic results.
In earlier research (e.g., Brown et al. (2006) cited by McBride (2022)), the main reason for developing spectral libraries was to see if one could use them to build unique calibrations to estimate soil properties. The approach was taken from spectroscopic studies in other domains, developed for estimating the concentrations of substances that are much less complex than soil. Soil scientists embraced the approach directly and without alteration, often resulting in suboptimal spectroscopic models of soil properties and a prejudiced perception of its potential.
Since then, research has shown that soil spectral libraries should be used differently, not to make large general models but as a source of information for building localised calibrations for specific contexts and pedologic domains. We have learned that one must be wary of using a general calibration derived from all of the data in a country-wide national or global library for estimation locally in a small region (Guerrero et al., 2016). Unlike the estimates from calibrations derived locally, these general models are likely to produce biased estimates because the general relationships will not depict those present locally and are likely also to mask local variations . Thus, even if we had spectral libraries with an infinite number of samples and derived calibrations with all the data, this problem would remain.
Current understanding and ongoing research suggest that large and diverse spectral libraries are beneficial, but the development of methods for 'localising' the spectroscopic modelling is equally important. The literature contains reports on several of these methods. For instance, practical methods such as spiking (Guerrero et al., 2010; and spiking with extraweighting (Guerrero et al., 2014), the LOCAL algorithm (Shenk & Westerhaus, 1991), memory-based learning (MBL) methods that focus on a deterministic local search of the spectral library (Ramirez-Lopez et al., 2013), datadriven stochastic search methods such as RS-LOCAL Shen et al., 2022), to feature-based deep transfer learning (Liu et al., 2018). McBride seems to have missed all of this research. Most recently, Shen et al. (2022) showed that instance-based deep transfer learning lessens the need for conventional analytical measurements. They suggest that as these methods and spectral libraries develop, the need for conventional analysis will further diminish and eventually disappear without losing accuracy that entirely local modelling would provide. Research is ongoing as those ideas need further experimentation and testing, however.

| CONCLUSIONS
The recent article in this journal by McBride (2022), under the heading 'Opinion', criticised reflectance spectroscopy for estimating the concentrations of soil constituents. Some of that criticism is fair; many exponents have exaggerated claims about the technology. Other aspects of McBride's Opinion are outdated, incorrect or otherwise misleading. We countered McBride's views to provide readers with more balanced insight into soil spectroscopy and its merits. In responding to McBride's article, we have tried to distinguish between fair comments and what we regard as outdated statements of fact and false inferences. We have concentrated on what we regard as the most important matters and we have omitted issues that are discussed elsewhere, for example, the spectroscopic estimation of trace elements in soil (Baveye & Laba, 2015;Shi et al., 2015).
Like most new methods and technologies, soil spectroscopy has had its hiccups, and some practitioners have undoubtedly been over-enthusiastic in its application. However, as the technology develops and interest in soil spectroscopy grows further, practitioners understand the various methods better in practice and research and are becoming more adept at applying them. As a result, soil spectroscopy is becoming an essential tool for measuring and monitoring soil and obtaining large amounts of quantitative soil information in a broad range of soil and environmental science applications. Most practitioners of soil spectroscopy do not seek to replace conventional analytical methods; instead, they see soil spectroscopy working alongside conventional analyses as an essential partner. Other scientists have recognised this too. The Global Soil Partnership of the Food and Agriculture Origination (FAO) of the United Nations and its Global Soil Laboratory Network initiative on soil spectroscopy (GLOSOLAN-Spec) is helping to combine conventional soil testing and spectroscopy. It is also helping to bring the community together via a scientific community of practice to steer future developments and build capacity in soil spectroscopy globally but recognising that individual countries will have their particular approaches.
Soil is complex, and so are the relationships between its components and the spectra. Sophisticated technologies and mathematical and statistical methods are needed to characterise and deal with such complexity. However, just because we do not understand them entirely does not mean we should shy away from them.
On the contrary, we should embrace them and strive to comprehend and use them to gain insights into the complexity of soil. We hope that our article gives readers and newcomers to soil spectroscopy a more objective, balanced and optimistic view of the subject and its future. ENDNOTE 1 http://www.cleanenergyregulator.gov.au/ERF/Choosing-aproject-type/Opportunities-for-the-land-sector/Agriculturalmethods/estimating-soil-organic-carbon-sequestration-usingmeasurement-and-models-method