A simple and nondestructive approach for the analysis of soluble solid content in citrus by using portable visible to near‐infrared spectroscopy

Abstract A simple and nondestructive method for the analysis of soluble solid content in citrus was established using portable visible to near‐infrared spectroscopy (Vis/NIRS) in reflectance mode in combination with appropriate chemometric methods. The spectra were obtained directly by the portable Vis/NIRS without destroying samples. Outlier detection was performed by using leave‐one‐out cross‐validation (LOOCV) with the 3σ criterion, and the calibration models were established by partial least squares (PLS) algorithm. Besides, different data pretreatment methods were used to eliminate noise and background interference before calibration, to determine the one that will lead to better model accuracy. However, the correlation coefficients are all <0.62 and the results of all pretreatments are still unsatisfactory. Variable selection methods were discussed for improving the accuracy, and variable adaptive boosting partial least squares (VABPLS) method was used to get higher robustness models. The results show that standard normal variate (SNV) transformation is the best pretreatment method, while VABPLS can significantly simplify the calculation and improve the result even without pretreatment. The correlation coefficient of the best prediction models is 0.82, while the value is 0.48 for the raw data. The high performance shows the feasibility of portable Vis/NIRS technology combination with appropriate chemometric methods for the determination of citrus soluble solid content.

and the results of all pretreatments are still unsatisfactory. Variable selection methods were discussed for improving the accuracy, and variable adaptive boosting partial least squares (VABPLS) method was used to get higher robustness models. The results show that standard normal variate (SNV) transformation is the best pretreatment method, while VABPLS can significantly simplify the calculation and improve the result even without pretreatment. The correlation coefficient of the best prediction models is 0.82, while the value is 0.48 for the raw data. The high performance shows the feasibility of portable Vis/NIRS technology combination with appropriate chemometric methods for the determination of citrus soluble solid content.

K E Y W O R D S
chemometric method, citrus soluble solid content, nondestructive, portable visible to nearinfrared spectroscopy, variable selection characteristic index to evaluate the internal quality of citrus fruits.
However, chemical titration methods demand time-consuming operations for sample preparation, while chromatographic methods generally demand expensive equipment and solvent elution. Sugar content represented by soluble solid content is the most important quality index for citrus industry to determine marketing standards (Jin Lee et al., 2004;Marrubini et al., 2016;Švecová et al., 2015).
A soluble solid content analyzer, named saccharimeter, was developed by measuring the refractive index or polarization rotation angle of optically active sugars, which is widely used in fruit and wine processing industries (Jin Lee et al., 2004). The method, however, still requires sample destruction and is time-consuming. Low cost, nondestructive, and accurate analysis of soluble solid content in citrus has become a new trend in citrus production area.
Visible to near-infrared spectroscopy (Vis/NIRS) is a simple, fast, and nondestructive analytical technique, which is widely used in the analysis of complex samples in food (Towett et al., 2013;Zhu, Chen, Wu, Xing, & Yuan, 2018), agriculture (Tardaguila, Fernández-Novales, Gutiérrez, & Paz Diago, 2017), and medicine industries (Li, Du, Cai, & Shao, 2012). In recent years, the development trend of Vis/NIRS instruments is miniaturization and low manufacturing cost. Various portable Vis/NIRS instruments have been developed (Cirilli et al., 2016). However, due to the miniaturization of instruments, there are many deficiencies in spectral resolution, scanning range, sensitivity, long-term stability, reliability, accuracy, and instrument standardization of portable Vis/NIRS instruments.
Besides, due to the low sensitivity and complexity of the samples, the useful information is usually carried by a broad spectral peak. In order to solve the problems, a large number of chemometric methods have been developed. Partial least squares (PLS) regression and related robust techniques are the most commonly used methods for establishing quantitative models (De Luca et al., 2019;Li, Shao, & Cai, 2007;Sampaio et al., 2018). Furthermore, a large number of spectral pretreatment methods for baseline correction and background removal were developed, while each possesses advantage and drawbacks (Bian, Li, Shao, & Liu, 2016;Shao, Bian, & Cai, 2010). It is very important to choose the proper pretreatment method, which can improve the accuracy of quantitative analysis model to a certain extent. Besides, poor models may be obtained when the spectra contain nonmodeled information. To solve this problem, variable selection methods such as Monte Carlo uninformative variable elimination (MC-UVE) (Cai, Li, & Shao, 2008), randomization test (RT) (Xu, Liu, Cai, & Shao, 2009), competitive adaptive reweighted sampling (CARS) (Li, Liang, Xu, & Cao, 2009), and related techniques (Han, Tan, et al., 2017) were proposed for building robust and accurate models. In our previous work, variable adaptive boosting partial least squares (VABPLS) (Li, Du, Ma, Zhou, & Jiang, 2018) was proposed to obtain robustness models and improve the prediction ability by simultaneous weighting samples and variables in the boosting step.
Vis/NIRS has attracted more and more attention due to its fast and nondestructive characteristics in the analysis of soluble solid content in citrus (Cavaco et al., 2018;Cayuela & Weiland, 2010;Cen, He, & Huang, 2007). The prediction of soluble solid content in citrus by Vis/NIRS and sensory test was investigated and the result shows that the nondestructive method can meet the sensory requirements of consumers (Yuan et al., 2013). The sample temperature affects the spectrum in a nonlinear way. To solve the problem, global temperature calibration model of Fourier transform near-infrared reflectance (FT-NIR) spectroscopy was developed and has been used successfully to measure soluble solid content in citrus (Lu et al., 2006). However, the peel of citrus has great interference to the spectra. In addition, most portable Vis/NIRS was grating scanning one, which is different from FT-NIR spectroscopy with good detection results. Noise, background, and nonmodeled information interference are unavoidable in portable Vis/NIRS signals. At present, there is little research on the application of portable Vis/NIRS in citrus soluble solid content analysis.
The aim of this study is to establish appropriate chemometric methods for portable Vis/NIRS instruments to obtain reliable and accurate results of citrus soluble solid content determination. Different pretreatment methods were analyzed, while variable selection methods and VABPLS method were investigated to get higher robustness models. Correlation coefficient of cross-validation (RCV) and root mean square error of cross-validation (RMSECV) were applied to evaluate the performances of the final models, while correlation coefficient (R) and root mean square error of prediction (RMSEP) were used to evaluate the methods. Furthermore, the selected characteristic wavelengths were also discussed in detail. Based on portable Vis/NIRS and chemometric methods, the technology can be regarded as a simple, low cost, nondestructive, and accurate way for the analysis of citrus soluble solid content, which can be applied in future fruit production.

| Citrus sample
In this study, 105 Citrus sinensis (L.) Osbeck samples of uniform color (orange), shape, and size (~60 mm diameter) were randomly purchased from local supermarkets between November and December. To reduce the effect of sample temperature on the prediction accuracy, the samples were placed at room temperature for 24 hr for equilibration.
Then, the samples were cleaned and numbered before measurement.

| Citrus soluble solid content determination
Soluble solid contents were measured from squeezed-out juices of the samples by digital refractometer saccharimeter (model PR-101, Atago Co. Ltd.) and were provided by Beijing Weichuangyingtu Technology Co., Ltd. (Jin Lee et al., 2004). Each content was averaged from three parallel measurements.

| Instrumentation and measurements
Vis/NIRS spectra were obtained by a NIRMagic 1,100 spectrometer (Beijing Weichuangyingtu Technology Co., Ltd) combining a standard multichannel grating detector in the diffuse reflectance mode with integrating sphere diffuse reflection accessory. The power of light source was 12 V/20 W, while the integration time and average time were 40 ms and 2s. The white reference and dark reference were collected for each collection. The citrus was placed directly in the center of the spot. The spectra were collected at the equator location, and the average of four equator locations was used. Each spectrum is composed of 501 data points recorded from 600 to 1,100 nm and is averaged from three parallel measurements.

| Data analysis
Outlier detection was performed by using leave-one-out cross-validation (LOOCV) with the 3σ criterion. Kennard-Stone (KS) method was used for the partition of the calibration and test set, and the calibration models were established by PLS algorithm. The performances of the developed models were evaluated in terms of RCV and RMSECV, while the prediction performances were evaluated by R and RMSEP. To some extent, the robustness of the model can be proved with the method. Monte Carlo cross-validation (MCCV) with adjusted Wold's R criterion was used for determination of latent variable (LV) number. Besides, to eliminate noise and background interference, the spectra were treated by different pretreatment methods, such as bias correction, detrend, standard normal variate (SNV) transformation, maximum and minimum normalization, multiplicative scatter correction (MSC), first-order derivative (1st) and second-order derivative (2nd), continuous wavelet transform (CWT), and their combinations, to obtain reliable quantitative calibration models. Generally, several hundreds or even thousands of variables (wavelength) can be obtained in a spectrum. Some of the variables may contribute more collinearity and noise than relevant information to models (Li et al., 2009;Xu et al., 2009). Poor models may be obtained when the spectra contain nonmodeled information.
Therefore, variable selection methods such as MC-UVE and CARS were used for building parsimonious and accurate models. The former method builds a large number of models with randomly selected calibration samples, and then, the wave numbers are evaluated with a parameter of stability. The larger the stability is, the more significant the wave number will be (Cai et al., 2008). The latter method mimics the "survival of the fittest" principle which is the basis of Darwin's Evolution Theory and has been successfully adopted to select the key wavelengths (Li et al., 2009). Besides, VABPLS was applied to get higher robustness models and enhance the prediction ability by simultaneous weighting of samples and variables in the boosting step (Li et al., 2018). Furthermore, consensus partial least squares regression (cPLS) and boosting PLS with the same training set and prediction set were used as comparison.
The programs were performed using Matlab 8.3 (The Mathworks) and run on a personal computer. Figure 1 shows the original spectra for the citrus dataset and distribution of soluble solid contents. It can be seen that obvious peaks around 680 and 750 nm in the spectra. Obvious noise interference exists at the range above 950 nm. Therefore, the spectra in the range of 600 to 950 nm were selected for the further calculation, which is consistent with previous report (Jin Lee et al., 2004). Besides, it can be seen that there is interference of baseline drift in the original spectra. It is not feasible to  Information about orange color is found at 650-700 nm range of the visible spectrum. A continuous increase in absorbance was observed from 710 to 990 nm. The peaks around 760 and 970 nm were normally attributed to water or OH groups. Furthermore, from Figure 1b, the soluble solid content of all samples ranged from 7.5 to 13 °Brix. Figure 2 shows B coefficients and variable importance in the projection (VIP) values. It can be seen that there is also obvious noise interference above 950 nm, and thus, the range of 600 to 950 nm was selected for the further calculation.

| Correlation between spectra and citrus soluble solid contents
A high regression coefficient can be found in the wavelengths of 640-700 and 890-940 nm, indicating the significance of these wavelengths.

| Outlier detection
Outliers may be caused by instability of instruments and operational errors, which may reduce the quality of the model. In this paper, the outlier detection was performed by using LOOCV with the 3σ criterion. Figure 3 shows a plot of the prediction errors and the 3σ criterion. It is clear that the value of sample no. 73 was out of the threshold, which was considered to be an outlier.

TA B L E 2 (Continued)
MSC, 1st, 2nd, CWT, and their combinations to obtain reliable quantitative calibration models. Table 1 shows a comparison of the LV, RMSECV, RCV, RMSEP, and R with the thirteen pretreatment methods. Compared with the results of the raw spectra in the range of 600-1100 nm, the effect of pretreatment methods has not been significantly improved. The optimal LV number is more than 10 with 1st-DT method. The R values are all less than 0.62, and the results of all pretreatments are still unsatisfactory.
Acceptable results cannot be obtained by the models directly built with the full spectra. This poor result may be caused by many reasons. One crucial reason of them is that the noise interference exists in the range above 950 nm. The comparison of the LV, RMSECV, RCV, RMSEP, and R with the spectra in the range of 600 to 950 nm was also shown in Table 1. It is clear that the optimal LV numbers are all less than 10, which are more reliable than the results with full spectra. The results of pretreatment methods are slightly better than those of the raw spectra except the 2nd method, and the combina-

| Variable selection
Variable selection can be used to further optimize the model of Vis/ NIR quantitative analysis. Variable selection methods, such as MC-UVE and CARS, were used for improving the accuracy in this study.
Besides, VABPLS method was used to get higher robustness models and enhance the prediction ability. Furthermore, the results of cPLS and boosting PLS with the same training set and prediction set were also obtained as comparison. A total of 100 independent runs were performed, and the means of numbers of variables, RMSEPs, and Rs are obtained. The performance of the final models was evaluated according to the RMSEP and R with the test set.
The PLS models developed with MC-UVE are shown in Table 2.
Compared with the raw spectra-PLS model, the numbers of variables decreased from 350 to 91 with the CWT-SNV method. SNV is the best pretreatment method, and R value is as high as 0.80. Figure 4a shows the variable distribution with MC-UVE and the SNV methods.
Variable selection can not only simplify the model, but also extract the wavelengths related to the components. Therefore, the wavelengths which are less interfered by orange peel can be obtained. As a consequence, mainly eight wavelength bands were retained. They  (Xu, Qi, Sun, Fu, & Ying, 2012). However, the numbers of variables for the raw spectra or the spectra using De Bias, Min Max, 1st, 2nd, and CWT methods are as high as 340, and the variable selection was rather unsatisfactory. As a result, the RMSEPs and Rs of the raw spectra by MC-UVE method are nearly the same as the results without variable selection.
The PLS models developed with CARS are shown in Table 2.
Compared with the raw spectra-PLS model, the numbers of variables with CARS method decreased from 350 to 81. The results of the raw spectra by CARS methods are better than those without variable F I G U R E 4 Variable distribution of MC-UVE and the SNV methods (a), variable distribution of CARS and SNV methods (b), and variable distribution of VABPLS and SNV methods (c) selection. The results of most pretreatment methods are better than those of the raw spectra, and the combinations of pretreatment methods can also improve the RMSECV values. SNV is still the best pretreatment method, and R value is as high as 0.821. Figure 4b shows the variable distribution with CARS and SNV methods.

| CON CLUS ION
A simple and nondestructive method for the direct analysis of soluble solid content in citrus was established using portable Vis/NIRS combination with appropriate chemometric methods. Data pretreatment methods can be used to eliminate noise and background interference, while variable selection significantly improves the accuracy.
SNV transformation is the best pretreatment method. VABPLS method can significantly simplify the calculation and improve the results. This developed technology based on portable Vis/NIRS and chemometric methods can be regarded as a simple, low cost, nondestructive, and accurate way for the analysis of citrus soluble solid content and can be widely applied in future production. The analysis of different citrus varieties will be considered in the future.

ACK N OWLED G M ENTS
This study was supported by National Natural Science Foundation

CO N FLI C T O F I NTE R E S T
The authors declared that they have no conflicts of interest to this work.

E TH I C A L A PPROVA L
This study does not involve any human or animal testing.