A novel method for the nondestructive classification of different‐age Citri Reticulatae Pericarpium based on data combination technique

Abstract The quality of Citri Reticulatae Pericarpium (CRP) is closely correlated with the aging time. However, CRPs in different storage ages are similar in appearance, and the young CRP may be labeled as the aged one to obtain the excess profit by some unscrupulous traders. Most traditional analysis methods are laborious and time‐consuming, and they can hardly realize the nondestructive classification. In this paper, a novel method based on near‐infrared diffuse reflectance spectroscopy (NIRDRS) and data combination technique for the nondestructive classification of different‐age CRPs was proposed. The CRPs in different storage ages (5, 10, 15, 20, and 25 years) were measured. The near‐infrared spectra of outer skin and inner capsule were obtained. Principal component analysis (PCA), soft independent modeling of class analogy (SIMCA), and Fisher's linear discriminant analysis (FLD), with different data pretreatment methods, were used for the classification analysis. Data combination of the outer skin and inner capsule spectra was discussed for further improving the classification results. The results show that multiple sensors provide more useful and complementary information than a single sensor does for improving the prediction accuracy. With the help of data combination strategy, 100% prediction accuracy can be obtained with both second‐order derivative–FLD and continuous wavelet transform–multiplicative scatter correction–FLD methods.

CRPs in different storage ages are similar in appearance, and it is difficult to distinguish them for the layman. The young CRP may be labeled as the aged one to obtain the excess profit by some unscrupulous traders. Therefore, it is very important to develop a simple, rapid, and accurate way for the classification of CRPs in different storage ages.
The thickness of peel, the size of secretory cavity, smell, and taste are used as the indicators to distinguish the CRPs in different storage ages. However, it is difficult for consumers and food inspectors due to the similar physical appearance, smell, and taste of CRPs in different storage ages. Instrument analysis is an effective method for the identification analysis, by analyzing the volatile compounds, flavonoids, alkaloids, and phenolic acids in CRPs. Gas chromatography-mass spectrometry (GC-MS) has been reported to compare comprehensively the volatile constituents in Citri Reticulatae Blanco Pericarpium (CRBP) and Citri Reticulatae Chachi Pericarpium (CRCP), with the help of principal component analysis (PCA) and orthogonal partial least squares discrimination analysis (Duan et al., 2016). The volatile oils and five bioactive flavonoids in CRP collected from different regions were analyzed by GC-MS and high-performance liquid chromatography (HPLC) (Luo et al., 2018).
Headspace-gas chromatography-ion mobility spectrometry (HS-GC-IMS) with PCA method was established to discriminate CRCP and CPBP by their volatile organic compounds (Lv et al., 2020).
Ultra-high-performance liquid chromatography quadrupole/time-offlight mass spectrometry (UHPLC-TOF/MS) is an effective method for the differentiation of CRCP and CPBP, and CRCP with different storage ages. 31 metabolites, such as aloesone, roseoside, and 7-hydroxy-5,3′,4′-trimethoxyflavone, were identified to distinguish CRCP in different storage ages (Luo et al., 2019). The results show an upward trend in 3-15 years and a downward trend to a stable state in 15-30 years, indicating that CRCP has the characteristics of "the longer it is stored, the better the quality is." However, the chromatographic methods usually require expensive equipment and tedious operation. Besides, samples need to be pretreated in these methods and nondestructive testing cannot be realized. Low-cost, nondestructive, and accurate methods for the classification of CRPs in different storage ages are still demanded.
Near-infrared diffuse reflectance spectroscopy (NIRDRS) has been widely used in the nondestructive analysis of complex samples in food Yu et al., 2009), agriculture Purcell et al., 2009;Tardaguila et al., 2017), and medicine industries (Li et al., 2012;Liu et al., 2018;Xu et al., 2015). The information of hydrogen-containing functional groups such as C-H, N-H, S-H, and O-H' stretching vibration can be obtained with NIRDRS.
However, the useful information of analytes is always embedded in the interference of overlapping and background. A large number of chemometric methods have been developed to solve the problems.
Spectral pretreatment methods have been used for the baseline correction and background removal, with different advantages and disadvantages (Bian et al., 2016;Han et al., 2017;Li et al., 2020;Ma, Liu, et al., 2020;Ma, Pang, et al., 2020;Shao et al., 2010). De-bias correction and detrend (DT) methods can be used to eliminate the interference of baseline drift . Standard normal variate (SNV) transformation and multiplicative scatter correction (MSC) methods are used to eliminate the scattering effect caused by uneven particle distribution and particle size. Maximum and minimum normalization (MinMax) method is a scaling technique that normalizes all the variables into a certain range (Bian et al., 2020). Firstorder derivative (1st Der), second-order derivative (2nd Der), and continuous wavelet transform (CWT) can subtract the influence of instrument background or drift on signal (Bian et al., 2020). However, in the results of higher order derivative, the noise level increases significantly . In addition, single pretreatment method can only suppress one certain interference in the spectra, and the optimal pretreatment method is usually different for different dataset. To solve the problem, the combination pretreatment methods are often used to eliminate various interferences in the spectra (Bian et al., 2020). PCA (Li et al., 2012), soft independent modeling of class analogy (SIMCA) (Szabó et al., 2018), and Fisher's linear discriminant analysis (FLD) (Witjes et al., 2003;Yan et al., 2018) have been applied for the classification, while partial least squares (PLS), boosting partial least squares (Shao et al., 2010), and related robust techniques Li et al., 2020;Ma, Liu, et al., 2020;Ma, Pang, et al., 2020;Melssen et al., 2007) were used for the quantitative analysis. Multiple sets of data provide more useful and complementary information than a single set. More and more attention was paid to the data combination of near-and mid-infrared spectroscopy, Raman spectroscopy, electronic nose, and electronic tongue. For example, the NIRDRS and HPLC data of lotus seed were combined into a new one to extract more information and submitted for building reliable model (Guo et al., 2017). With the help of data combination method, the quantitative predicting analysis of liensinine, rutin, total sugar, and total polysaccharide in Lotus seed samples can simultaneously and successfully be performed.
Though research about the analysis of complex food samples with the NIRDRS with chemometric methods has been widely reported, the research of identification of CRPs in different storage ages is rare, due to the complexity of the composition of CRP and little differences in the compositions of storage ages (Zhou et al., 2015). The aim of this study was to use NIRDRS instrument and data combination technique to obtain reliable and accurate identification results of CRPs in storage ages. The near-infrared spectra of outer skin and inner capsule were obtained directly by NIRDRS instrument. PCA, SIMCA, and FLD, with different data pretreatment methods, were used for the classification analysis of CRPs in different storage ages.
Data combination of the outer skin and inner capsule spectra was discussed for improving the classification results.

CRP sample
In this study, CRPs in different storage ages (5, 10, 15, 20, and 25 years) were collected from Guangdong Fu Dong Hai Co., Ltd, and 40 samples were taken from each age-group. Therefore, 200 CPR samples were analyzed.

Instrumentation and measurements
All spectra were obtained by a MPA spectrometer (Bruker Optics Inc.) in diffuse reflectance mode with integrating sphere diffuse reflection accessory (Bruker Optics Inc.). Each CRP is composed of three petals of pericarp (~50 mm diameter), and a petal for each CRP without destroying was placed directly in the light spot center without the container. Each spectrum is composed of 2,204 data points recorded from 12,000 to 3,500 cm −1 . The measurements were repeated three times and averaged. F I G U R E 2 PCA results with different data: raw data (a) and data with MSC pretreatment (b) for the outer skin, and raw data (c) and data with MSC pretreatment (d) for the inner capsule creation of the models. For FLD method, the total number of objects should be equal to at least three to five times the number of variables. Therefore, for classification by FLD modeling, PCA was applied to reduce the multidimensionality into fewer principal components (PCs). A maximum of 30 PCs were selected and most variations (~99%) were explained (Wang et al., 2013).

Data analysis
Multiple sets of data may provide more useful and complementary information than a single set. More and more attention was paid to the data combination. Many literatures indicate that better classification results can be obtained when the measured data obtained from different analytical techniques such as near-and mid-infrared spectroscopy, Raman spectroscopy, electronic nose, and electronic tongue were combined (Guo et al., 2017;Zhuang et al., 2014). The appearances and compositions of outer skin and inner capsule are different. Therefore, in this paper, the method of data combination was used to improve the accuracy of classification results in both SIMCA and FLD methods. In the SIMCA method, the matrix of the outer skin (n × m) and inner capsule spectra (n × m) were combined into a new matrix (n × 2m), and the combination data were obtained.
In the FLD method, PCA was applied to the processing of spectral data, which reduced the multidimensionality into fewer PCs. For both outer skin and inner capsule data, most variations (~99%) can be explained with the PC number 30. Therefore, 30 PCs were selected. Scores of these PCs of outer skin and inner capsule were combined as the combination data.
The programs were performed using MATLAB 8.3 ( MathWorks, USA) and run on a personal computer. Figure 1 shows the original spectra of outer skin and inner capsule.

NIRDRS spectra of CRP samples in different storage ages
There is very serious background interference in the spectra of both outer skin and inner capsule. The drifting baselines affect the accuracy of the result, due to the unsmooth and rough surface of the CPR  In order to eliminate background interference and obtain reliable classification models, the classification models were established by SIMCA algorithm with different pretreatment techniques. Table 1 shows the classification accuracies obtained by SIMCA and different pretreatment methods. It is obvious that the classification accuracies with the outer skin spectra are higher than that with the inner capsule spectra. The classification accuracies were improved TA B L E 2 Classification accuracies obtained by FLD and different pretreatment methods

Dataset Pretreatment method 5 years (%) 10 years (%) 15 years (%) 20 years (%) 25 years (%) Whole data (%)
Outer by using the single pretreatment methods, compared with the results with the raw data. For the analysis of outer skin data, de-bias and MinMax methods are the best pretreatment methods, and the classification accuracies of whole data are both 79.10%. However, due to the increase of noise level in higher order derivative calculation, the result with 2nd method is not satisfactory. The classification accuracies of 10 and 15 years are 0% with 2nd method. For the analysis of inner capsule data, CWT method is the best pretreatment method, and the classification accuracy of whole data is 74.63%. The classification accuracies of 15 years are the worse for both outer skin and inner capsule data. This is consistent with the change of metabolites in CPRs in different storage ages (Luo et al., 2019). Besides, compared with the results of single pretreatment methods, the classification accuracies were not improved by using the combined pretreatment methods. The useful information is lost with too many pretreatment methods. It seems difficult to perform the classification using SIMCA methods, even when the different pretreatment methods are adopted.

Classification of single spectral data with FLD and different pretreatment techniques
FLD method is a powerful supervised classification method, which is used to find the optimal boundary between object classes. Therefore, in the study, FLD method was used for the classification analysis of CRP samples. The results of inner capsule and outer skin data were compared. Different pretreatment techniques were applied to further optimize the classification model.

Classification of the combination data with SIMCA and different pretreatment techniques
Multiple sensors may provide more useful and complementary information than a single sensor does for improving the prediction results (Guo et al., 2017;Zhuang et al., 2014). Therefore, data combination method was developed to improve the accuracy of classification results in SIMCA method. Different pretreatment techniques were used to optimize the classification model. Table 3 shows the classification accuracies of the combination data with SIMCA and different pretreatment methods. Compared with the results of single outer skin and inner capsule spectra, the classification accuracies with SIMCA method have been further improved. The accuracies of identification are more than 80% with de-bias, SNV, and MinMax F I G U R E 3 FLD score plot for discrimination of CRPs in different storage ages with second-order derivative-FLD (a) and CWT-MSC-FLD methods (b) methods. De-bias is the best pretreatment method, and the classification accuracy for the whole data is 85.07%. However, the results are still unsatisfactory, especially for the identification of 15-year CRP.

Classification of the combination data with FLD and different pretreatment techniques
Data combination with FLD and different pretreatment techniques were used to get accurate classification results. Table 3 shows the classification accuracies of the combination data obtained by FLD and different pretreatment methods. The classification accuracies with FLD method are further improved with the combination data, compared with the results of outer skin and inner capsule spectra.
The identification accuracy of combination data is more than 94.03% even without spectral pretreatment, while the accuracy of identification is 91.04% with the single spectral data. Spectral pretreatment methods can further optimize the classification model. Data combination models based on second-order derivative-FLD and continuous wavelet transform-multiplicative scatter correction-FLD obtained best results with 100% prediction accuracy. Furthermore, Figure 3 is the FLD score plot for discrimination of CRPs in different storage ages with second-order derivative-FLD and CWT-MSC-FLD methods, and all the five groups were visually separated. The results demonstrate that the classification of CRP in different storage ages can be achieved by data combination and appropriate chemometric methods.

CO N FLI C T O F I NTE R E S T
The authors declared that they have no conflicts of interest to this work.