Species discrimination and total polyphenol prediction of porcini mushrooms by fourier transform mid‐infrared (FT‐MIR) spectrometry combined with multivariate statistical analysis

Abstract The plateau specialty agricultural products, wild porcini mushrooms, have great value both as a superb cuisine and as a potential medication. Due to quality different between species added with the fraud behavior in sales process, make poor quality or poisonous sample inflow into the market, which pose a health risk for consumers, but also disrupted the mushroom market. Traditional analysis way is time‐consuming and laborious. Therefore, the aim of this study is to develop a way using fourier transform mid‐infrared (FT‐MIR) spectrometry and data fusion strategies for the fast and accurate species discrimination and predict amount of total polyphenol in four porcini mushrooms. The t‐distributed stochastic neighbor embedding based on mid‐level data fusion showed two species of Boletus edulis and B. umbriniporus have been identified. The order of correct rate of PLS‐DA models was mid‐level data fusionq (100%) > mid‐level data fusione (97.06%) = mid‐level data fusionv (97.06%) = stipes (97.06%) > low‐level data fusion (94.12%) > caps (91.18%). The order of correct rate of grid‐search support vector machine models was low‐level data fusion (100%) > caps (94.12%) > stipes (91.18%), and the order of particle swarm optimization support vector machine was low‐level data fusion (100%) > caps (97.06%) > stipes (88.24%). The mid‐level data fusionq and low‐level data fusion had best discrimination accuracy (100%) allowing each mushroom classed into its real species, which could be used for accurate discrimination of samples. B. edulis mushrooms had highest total polyphenol, with 14.76 mg/g dw and 17.33 in caps and stipes mg/g dw, respectively. The phenols were easier to accumulate in the caps in Leccinum rugosiceps (1.03) and B. tomentipes (1.19), and the opposite phenomenon is observed in B. edulis (0.85) and B. umbriniporus (0.95). The correlation coefficient and residual predictive deviation of best prediction model were 86.76% and 2.40%, respectively, indicating that that there is good relevance between FT‐MIR and total polyphenol content, which could be used to predict roughly polyphenols content in mushrooms.

. They have also many mineral compounds including valued selenium with high antioxidant properties, but it can be dependent on species and its place of origin (Bekyarov et al., 2011). Polyphenols are one of the compound which have positive function for human health (Wong & Chye, 2009), wildly finding in vegetables, fruits, and grain products (Velioglu, Mazza, Gao, & Oomah, 1998). Studies have revealed that polyphenols available from mushrooms have higher antioxidant activities compared with many fruits and vegetables (Ismail, Marjan, & Foong, 22004;Jia, Tang, & Wu, 1999;Velioglu et al., 1998).
The total porcini mushrooms in Yunnan Province are approximately to be 224 species in which 114 species are edible. Porcini mushrooms are one of the "four kings of mushroom" and are one of the traditional bulks exported agricultural products. According to statistical data, export quantity of wild porcini mushrooms from Yunnan Province accounts for about 80% of total export quantity of China (http://ynsyj.yncoop.com). The average price per kilogram of hot-selling porcini mushrooms can reach about 90 CNY according to market investigation at August 2019. Due to high economic value, mushroom can be used by people of noncompliance with the morals to seek profits. Some fraud actions have been reported.
Several mushrooms without formally scientific names have been sold with the mushrooms from species of porcini (Dentinger & Suz, 2014). The fraudulent action is also reported by Casale et al. (2012), which revealed that commercial dried mushrooms which are composed of Boletus edulis and related species have been adulterated with B. violaceofuscus. These actions have seriously influencing mushrooms quality and interfering quality supervision, resource evaluation, food safety, and even man health. In addition, there is just one specie namely Phlebopus portentosus can be cultivated successfully, and the over-picking by local peoples of lower income makes the seasonal wild porcini mushroom increasingly rarer. These reasons stated above confirm the urgent necessity regarding the species discrimination for porcini mushrooms.
Morphology is a common method based on surface features from fruiting body such as color, shape, size, or reticulate pattern (Tsujikawa et al., 2003). However, this method did neglect the phenotypic variability of mushroom. Besides, fruiting body will lose its characteristics when it suffered from drying, dehydration, salting, and bleaching, and usually processed into slice, can, or extractum to adapt long-distance transportation for exporting. Molecular tools have been developed to elude these flaws. Internal transcribed spacer (ITS) primer of 28 B. edulis samples combining phylogenetic tree analysis has provided an effective identification and revealed that B. edulis mushrooms are often sold by intermixed with mushrooms species of B. violaceofuscus (Mello et al., 2006). Another study for species clarification used ITS and fungal immunomodulatory protein (FIP) sequences (Zhou, Liu, Guo, Su, & Zhang, 2016), shown that mushrooms from Ganoderma actually belong to different species. Now, nevertheless, infrared spectroscopy is preferred by researchers because this analytical technique has advantages of wide range of applications (liquid, solid, and gaseous samples), less sample consumption, without destroying samples and rapid, not only for compound identification and molecular structure characterization, but also for quantitative analysis (Borràs et al., 2015;Dutta, 2017). Just as there are no two identical leaves in the world, there are also differences between samples, which can be displayed in the infrared spectrum.
Influenced by latitude, the sea buckthorn berries collected from Québec, Canada, may have better pleasant taste than those grown at Sammalmäki, Finland, which contributed by higher levels of total sugar and sugar/acid ratio and a lower level of total acid (Zheng, Yang, Trépanier, & Kallio, 2012). It is this variability between samples that is key for building spectroscopic fingerprint spectrum and make it possible to track the sample source. Fingerprint spectrum available from Fourier transform mid-infrared (FT-MIR) spectra has been united with chemometrics for food quality like tannin characterization (Tondi & Petutschnigg, 2015), honey adulteration (Das et al., 2017), camel milk metamorphism (Nagy et al., 2019), content determination of caffeine and trigonelline (Hagos, Redi-Abshiro, & Chandravanshi, 2018), variant screen of short-chain fructooligosaccharides (scFOS) (Trollope, Nieuwoudt, Görgens, & Volschenk, 2014), and storage effects on saffron (Ordoudi, de los Mozos Pascual, & Tsimidou, 2014). Despite the excellent practicability of FT-MIR spectroscopy, potential flaw of the single data matrix is that it is not enough to represent the entire sample (Borràs et al., 2015).
Data fusion is a rising technology intersected by multiple disciplines for resulting in a satisfactory inference. Thanks to the rapid development of computers, improvement of experimental instruments, and gradual maturation of data fusion; data fusion has been broadly used in many areas including environment supervision, food quality, medical diagnosis, military, mapping, and robot (Hall & Llinas, 2001).
Low-level data fusion is an effective strategy to distinguish official rhubarb (recorded in the Chinese Pharmacopoeia) and unofficial rhubarb (Sun, Zhang, Zhang, & Zhu, 2017), and also to trace geographical origin of herbal medicine of Panax notoginseng (Li, Zhang, & Wang, 2018).
Another commonly used strategy is mid-level data fusion. Unlike direct data splices in low-level data fusion, mid-level data fusion digs effective information available from singe data matrices in order to enhance run speed of algorithm and output a better outcome. This strategy of midlevel data fusion has been used to trace specie and geographical origin of Porcini mushrooms (Yao, Li, Liu, Li, & Wang, 2018) and also to classify organic and nonorganic orange juices (Cuevas, Pereira-Caro, Moreno-Rojas, Muñoz-Redondo, & Ruiz-Moreno, 2017), and the manufacturer of beer with same brand and product could be discriminated (Vera et al., 2011). According to literatures, the several common ways of feature selection are: as follows (1) latent variables (LVs) which selected according to R 2 Y (cum) and Q 2 (cum) based on partial least squares-discriminant analysis (PLS-DA) model (Yao et al., 2018), (2) principal components (PCs) obtained from principal component analysis (PCA) (Vera et al., 2011), (3) variable importance in the projections (VIPs) picked by the values of VIP >1 (Qi, Liu, Li, Li, & &. Wang, 22018). R 2 Y (cum) and Q 2 (cum) were used to assess the ability of the model to fit data of training set and to predict new sample (test set). The R 2 Y (cum) represents the degree of fitting the data; a large value (close to 1) is usually a necessary condition for a good model. However, the Q 2 (cum) represents the predictivity of model, a large value (>0.5) indicates good potential to predict sample origin. VIP values larger than 1 indicate "important" X-variables, while the values lower than 0.5 indicate "unimportant" X-variables.
Although there are many studies on the species discrimination using data fusion strategy, there is no enough information regarding the data fusion using different morphological parts of porcini mushrooms of Yunnan province. The current study aims to find a fast and simple way to discriminate species and predict content of polyphenol using Fourier transform mid-infrared (FT-MIR) spectra available from different morphological parts of mushroom.  Table 1. Picking behaviors were carried out in the forests. In other words, the collection locations were in mountain area, which is far away from villages and human activities. Therefore, these mushrooms could be considered as clean and undamaged. Fruiting bodies of similar-sized and robust were selected for analysis.

| Sampling and sample preparation
The fresh fruiting bodies were treated using soft brush to cast off soils and other litters (leaves or branches) and then washed with running tap water until no impurity is visible by naked eyes. These watery mushrooms were dried in laboratory oven at 50°C for constant mass and crushed immediately. It is underlined that each dried fruiting body was crushed after dividing into two morphological parts, namely cap and stipe. Finally, the crushed sample was passed through an 80-mesh sieve and then stored in zip lock bag wait for next analysis.

| Determining FT-MIR information
Spectra information was determined using common method of KBr tableting. The 1.0 ± 0.2 mg sample was homogenized using 100 ± 0.2 mg KBr powder in an agate mortar in first step. Because KBr is easy to absorb moisture, this step was completed under infrared light. After thoroughly homogenization, mixed sample was adjusted to a thin slice under pressure available from tablet press. In last step, the FT-MIR spectra were immediately determined with a Fourier transform infrared spectroscopy spectrometer equipped with a DTGS detector. The determination conditions of wavenumber range, the number of successive scans, and resolution were set as 4,000 to 400 cm −1 , 64 and 4 cm −1 , respectively.

| Determining amount of total polyphenol
The unerring 0.2500 g sample was extracted by an ultrasonic cleaning machine with 5 ml (solid-liquid ratio of 1:20) of 40% ethanol at room temperature for 30 min. Then, the extracted liquid was filtered through qualitative filter paper, and the filtrate was adjusted to 25 ml with deionized water.
The determination method of vis absorbance was similar to the procedures reported by Kaewnarin et al. (2016). The 0.1 ml filtrate was transferred in a test tube using transfer liquid gun and then added into 7.9 ml distilled water to dilute 80 folds. After that, the sample was added with 0.5 ml of 1 mol/L Foline-Ciocalteu reagent.
Eight minutes later, 1.5 ml of 20% saturated sodium carbonate solution was added to the diluted sample. After seal by preservative film, the sample was incubated for 1 hr at room temperature and then determined using visible spectrophotometer at 760 nm. Calibration curve for quantification was constructed by gallic acid. All the reagents belong to analytical reagent (AR) grade.

| Statistical analyses
Spectra information was converted to data matrix using software Omnic 8.0 (Thermo Fisher Scientific Inc.). In order to fuse low-level data, data matrices from the two parts were directly spliced. Three types of feature variables were used, which correspond to PCs, LVs, and VIPs. And the selection standards were eigenvalue >1, maximum Q 2 (cum), and VIP >1, respectively. These feature matrices were spliced according to same extraction way to get three new data matrices for mid-level data fusion. The t-distributed stochastic neighbor embedding (t-SNE) was applied to preliminarily visualize discrimination of mushrooms. For building discrimination model, fix data matrices (two single, one low-level data fusion and three mid-level data fusion) were divided into training set and test set according to Kennard-Stone (KS) algorithm. Training set with 2/3 data was used to build model, while the rest of data namely test set were used to test discriminatory capacity of the model. Supervised classification methods of PLS-DA, grid-search support vector machine (GS-SVM) and particle swarm optimization support vector machine (PSO-SVM) were developed with help of software of SIMCA-P + 13.0 (Umetrics AB) and MATLAB R2014a (Math works).
In essay of total polyphenol prediction, the amount and extraction rate of total polyphenol of per sample were calculated based on vis absorbance. Then, the data of amount and FT-MIR both from same morphological part were spliced to develop two new data matrices for total polyphenol prediction. Data treatments of first-order derivative (FD), second-order derivative (SD), multiplicative scattering correction (MSC), standard normal variate (SNV), Savitzky-Golay (SG) smoothing, and combination of these treatments were performed to optimize spectral data. The derivative order, points in each, and distance between each were set as 1, 15, and 1 and 2, 15, and 1 in FD and FD, respectively. The order derivative was powerful to enhance resolution, whereas it also enhanced the noise (Roy, 2015). SG smoothing may be a helpful method to solve this problem (Xu et al., 2008). The MSC and SNV were usually applied to reduce the light scatter effect caused by sizes and shapes of granular sample (Helland, Naes, & Isaksson, 1995). Before KS, data matrices were normalized to a range of 1-2 to avoid reduced accuracy brought by different dimensions. The method of KS was as described above. The prediction of total polyphenol was performed using supervised regression methods of GS-SVM. Root mean square error of cross validation (RMSECV) and root mean square error of prediction (RMSEP) were applied to estimate total error for samples and estimate of total prediction error, respectively. The calculation approaches were as follows: where the n c represents the quantity of samples of training set while n p represents the quantity of samples of test set, and the ŷ i and y i represent the predicted and measured value, respectively.
Coefficient of determination (R 2 ) was used to reflect the correlation relationship between predicted and measured value. Another useful assessment parameter of residual predictive deviation (RPD) was calculated via standard deviation (SD) divide by RMSEP. The calculation methods were as follows: The RPD value between 1.5 and 2 indicated that the model can discriminate low values from high values of the contents. For predicting roughly, the RPD between 2 and 2.5 is a necessary condition, while the RPD greater than 2.5 or 3 corresponds excellent power for content prediction (Nicolaï et al., 2007). It should be noted that the data of amounts and spectra were one-to-one correspondence in total polyphenol prediction. For better understanding this study, the workflow is shown in Figure 1.

| FT-MIR absorption peaks interpretation
The raw average spectra of caps and stipes from four species of porcini mushrooms are shown in Figure 2. The major absorption peaks of gaps and stipes have been interpreted. It can be seen from Figure 2 that the absorption bands of 2,928, 1,313, 1,082, and 1,025 cm −1 were shared in caps and stipes. The single perk at 2,928 cm −1 could be assigned to methylene C-H asymmetric stretching; the weak peak around 1,313 cm −1 was the vinylidene C-H inplane bend (Coates, 2000). The peaks around 1025-1082 cm −1 corresponded to -C-O stretching (Yang & Irudayaraj, 2000). Two

| Visualizing discrimination by t-SNE
PCA and t-SNE are the most common approaches to provide more vivid and easy-to-understand classification result. However, both methods can be used to draw two-or three-dimensional graphics to visualize sample classification. The difference is that the t-SNE can capture more information to draw the graphics using random walks on neighborhood graphs. Studies have reported that t-SNE overmatches existing most advanced techniques such as sammon mapping, lsomap, or locally linear embedding (Maaten & Hinton, 2008).
Here, specie discrimination by t-SNE is shown in Figure 3 where dotted circle represents 95% confidence level of per species.
Mushroom classifications were disordered using single data matrices in which any one species could not be separated (Figure 3a,b).
Low-level data fusion was superior to single data matrix (Figure 3c).
Samples show clustering tendency, and the samples that belong to species of B. edulis show an obvious clustering, although one sample is beyond confidence interval and far away from other samples within the species. Mid-level data fusion e shows a better result, with all samples that belong to B. edulis clustered together and landed within confidence interval but still be mixed with samples belonging to other species (Figure 3d1). Compared with mid-level data fusion e , mid-level data fusion v displayed a better discrimination in which one species namely B. edulis was separated completely ( Figure 3d2). The mid-level data fusion q had best cluster classification among all scatterplots, which allowed two species of B. edulis and B. umbriniporus be discriminated (Figure 3d3). Besides, clustering results reveal that the within a species is smaller than that between species since the clustering effect was better in same species than in different species. For higher discrimination F I G U R E 1 Workflow of species discrimination and total polyphenol prediction result which per sample was classified into actual species, supervised model combination with data fusion was performed.

| Discriminating samples by PLS-DA model
The PLS-DA is confirmed as a universal strategy for reducing variables, regression prediction, source discrimination, etc. (Loong, Liong, & Jemain, 2018). Table 2 shows the results both the parameters of models and discrimination accuracy of test set. The parameters of RMSEE and RMSECV were used to assess the reliability of training set and test set, respectively; a small value is usually necessary for model quality. Compared with caps, the stipes model has better performance, with the values of parameters of R 2 Y (cum) , Q 2 (cum) , RMSEE, and RMSECV as 0.9, 0.7, 0.15, and 0.26, respectively, and the total correctness rate of test set, with a rate of 97.06%, was higher than that of caps, with a rate of 91.18%. In data fusion strategies, the same total accuracy of 97.06% has been noticed in mid-level e and mid-level v data fusion. Two samples from species of L. rugosiceps were misclassed into B. tomentipes, both in mid-level e and mid-level v data fusion. The lower accuracy (94.12%) at low-level data fusion may due to redundant information brought by data direct splicing (Borràs et al., 2015). Importantly, model of mid-level data fusion q has highest value of R 2 Y (cum) (0.91), Q 2 (cum) (0.86) and lowest value of RMSEE (0.13), RMSECV (0.19), which meant that the model could fit the new data very well and had a good potential to predict sample origin, and the fact is that this model did allow per sample classed into real species corresponding to total accuracy of 100%.
The order of accuracies in all classification models is as follows: mid-level data fusion q > mid-level e data fusion = mid-level v data fusion = stipes > low-level data fusion > caps, which indicated that mid-level data fusion q combined with PLS-DA model can be seen as a reliable way to discriminate mushroom species.

| Discriminating samples by SVM model
Supervised models of GS-SVM and PSO-SVM were developed to compare with the models of PLS-DA for fast and simple way for discriminating mushroom species. Table 3  both the parameters of model and total correctness rate. As shown in Table 3, the model PSO-SVM had higher accuracy of test set than GS-SVM based on caps data, with 97.06% higher than 94.12%, and the best parameters of penalty parameter (c) and kernel parameter (g) were 30.44 and 20.93, respectively. Correspondingly, the misclassified samples of PSO-SVM were lower than GS-SVM, with one lower than two. In the models of stipes, however, discrimination accuracy was lower in PSO-SVM model (88.24%) compared with GS-SVM model (91.18%). The highest accuracy was noticed using low-level data fusion strategy, which means the 100% correctness rate and without misclassified sample in both GS-SVM and PSO-SVM. Regarding the best discrimination model, low-level data fusion combined with model of GS-SVM or PSO-SVM could be used as a fast and simple approach to discriminate mushroom species.

| Total polyphenol content
In this study, 40% ethanol was used for total polyphenol extraction from mushroom. The arithmetic means, standard deviation, median value, content range, extraction percentage, and bioconcentration factor of phenolic compounds in two morphological parts of mushrooms are presented in Table 4.
The median values of total phenolic content for four species analyzed varied between 10.13 and 14.76 mg/g dw in caps, and between 10.70 and 17.33 mg/g dw in stipes. Compared with B. tomentipes, L. rugosiceps have greater total polyphenol content, with the median values of the caps and stipes as 14.6 mg/g dw and 13.68 mg/g dw, respectively, then 12.03 mg/g dw and 10.09 mg/g dw, respectively.
The smallest contents were found in species of B. umbriniporus, with the median values of 10.13 mg/g dw in caps and 10.70 mg/g dw in stipes. Among these results, mushrooms belong to B. edulis showed greatest total phenolic content both among the caps (14.76 mg/g dw) and stipes (17.33 mg/g dw). The content range of caps was from 12.28 to 17.55 mg/g dw, while the range of stipes was from 8.16 to 19.71 mg/g dw. Higher total phenolic amount may be the major contributes to the better flavor of mushrooms of B. edulis which show higher popularity in customers.
consisted of pileipellis, flesh, and hymenium, which are important for its metabolism and reproduction (Casale et al., 2012). Therefore, cap of mushroom may be a major part to enrich active constituents and can be the primary choice to meet purpose of diet nutrition and natural antioxidants.

| Predicting content of total polyphenol
The regression model of GS-SVM was developed using raw spectral information from two morphological parts of mushrooms combined with various data treatments to predict amount of total polyphenol in four porcini mushrooms. Table 5 shows the results, and the best prediction results by the two parts are shown in Figure 4. The RPD value of models based on four data treatments of FD, FD + SG (7), FD + SG (9), and SNV + FD+SG (9), respectively, was 1.52, 1.51, 1.50, and 1.86 when using caps data, which indicated these models could be used to distinguish sample with high concentration of polyphenol.
The best prediction model was achieved by SD-treated, the RMSEP, RPD and R 2 were 0.82, 2.4 and 86.76%, which indicated the good relationship between polyphenol content and FT-MIR data, and this model could be applied to predict total polyphenol content roughly.
In models by stipes data, the best R 2 (84.66%) was obtained by FDtreated, which indicated the good relationship between polyphenol content and FT-MIR data. However, the best prediction was achieved by model with data treatment of MSC + FD+SG (9), with the values of 0.83, 2.32, and 1.78 corresponded to R 2 , RMSEP, and RPD, respectively, which indicated that the model could discriminate high values from low values among these amounts. In summary, the prediction result by caps data was superior to result by stipes data. Although numerous data treatments have been used to improve data quality, this study did not obtain an excellent model for accurately predicting the content of total polyphenol. This result may be caused by insufficient sample size, and better model may be obtained by increasing sample size and applying data fusion strategies in the future.

| CON CLUS ION
The test results showed that this study did provide a fast and accurate method for species discrimination. In addition to the use of low-level data fusion, three types of feature variables were selected simultaneously to complete mid-level data fusion, which indicated that the PLS-DA model based on feature variable selected by maximum Q 2 (cum) , GS-SVM, and PSO-SVM models based on lowlevel data fusion had 100% discrimination accuracy allowing each mushroom classed into its real species. Moreover, this study also measured alcohol-soluble polyphenols content in four porcini mushrooms, offered accumulation tendency of polyphenols in different parts of mushroom for the first time and predicted the polyphenol content for four porcini mushrooms. The result suggested that the way of caps data combination with second-order derivative can be used to predict roughly polyphenols content in four porcini mushrooms. These outcomes from this work can provide academic references for origin traceability, market supervision, quality evaluation, and edible security.