Detection of fraud in lime juice using pattern recognition techniques and FT‐IR spectroscopy

Abstract The lime juice is one of the products that has always fallen victim to fraud by manufacturers for reducing the cost of products. The aim of this research was to determine fraud in distributed lime juice products from different factories in Iran. In this study, 101 samples were collected from markets and also prepared manually and finally derived into 5 classes as follows: two natural classes (Citrus limetta, Citrus aurantifolia), including 17 samples, and three reconstructed classes, including 84 samples (made from Spanish concentrate, Chinese concentrate, and concentrate containing adulteration compounds). The lime juice samples were freeze‐dried and analyzed using FT‐IR spectroscopy. At first, principal component analysis (PCA) was applied for clustering, but the samples were not thoroughly clustered with respect to their original groups in score plots. To enhance the classification rates, different chemometric algorithms including variable importance in projection (VIP), partial least square‐discriminant analysis (PLS‐DA), and counter propagation artificial neural networks (CPANN) were used. The best discriminatory wavenumbers related to each class were selected using the VIP‐PLS‐DA algorithm. Then, the CPANN algorithm was used as a nonlinear mapping tool for classification of the samples based on their original groups. The lime juice samples were correctly designated to their original groups in CPANN maps and the overall accuracy of the model reached up to 0.96 and 0.87 for the training and validation procedures. This level of accuracy indicated the FT‐IR spectroscopy coupled with VIP‐PLS‐DA and CPANN methods can be used successfully for detection of authenticity of lime juice samples.

manufacturers add the adulterant compounds into reconstructed lime juices. Nowadays, with increasing awareness of people, the consumption of beverages is not only for the taste, freshness, and thirst but also because of the health benefits and antioxidant compounds such as phenolic compounds and carotenoids (Zielinski et al., 2014). Foods and beverages have been intentionally or unintentionally contaminated with chemical compounds; however, added substances may not be harmful to human health, but are considered to be fraudulent (Everstine et al., 2013), and these compounds may cause allergic effects. The beverage industry has grown dramatically, and on the other hand, cheating in juices has also increased (Kelly & Downey, 2005). Fruit juice is one of the commonest industrial products that have been subjected to change and fraud (Abad-García et al., 2014).
Fruit juice adulteration certainly has negative effects on juice quality. Moreover, with the global increase in adulteration, the impact of a single food adulteration event will affect a larger and wider population than ever. As contamination of lime juice with different compositions is an ongoing problem, appropriate analytical techniques are needed to detect frauds (Zhang et al., 2009).
Presently, there are high demands for the use of fast methods along with chemometrics for determination of fraud (Callao & Ruisánchez, 2018). The profiling methods such as Fourier transform infrared (FT-IR), near infrared (NIR), and Raman spectroscopy would show the overall ingredients within a sample rather than looking for a single indicator component. FT-IR spectroscopy has significantly developed to determine the biochemical patterns that can be processed by statistical classification methods such as principal components analysis (PCA), linear discriminate analysis (LDA), artificial neural networks (ANNs), and support vector machines (SVM) (Callao & Ruisánchez, 2018).
IR spectroscopy measures the vibrations of molecules and each functional group or structural feature of a molecule has a special vibrational frequency that indicates which functional groups are present within the samples. The characteristic spectra of samples provide a molecular "fingerprint" that can be used for classification purposes. Moreover, in this technique, relatively little sample preparation is needed, the whole analysis is rapid, cheap, and it can be easily employed in fundamental research and in the qualification control process. But, high-performance liquid chromatography and gas chromatography have high costs (solvent consumption) and also require a long time for experimentation (Rodriguez-Saona & Allendorf, 2011).
Multivariate data from FT-IR need adequate statistical techniques to establish inferences among different samples concerning signal fluctuations. In this line, chemometric methods have been applied to evaluate the natural and geographical regions of foods, such as edible oils (Noorali et al., 2014), saffron (Petrakis et al., 2015), meat products (Pieszczek et al., 2018), wine (Fan et al., 2018), and juice (Ghaderi-Ghahfarokhi et al., 2016).
There is a lot of information accumulated in FT-IR spectra, and every single wavelength (number wave) contains special information about molecular structures. However, in many cases, a single wavelength cannot solve problems related to pattern recognition studies. Despite univariate approaches, multivariate methods are able to tackle recognition problems. One of the most important issues in multivariate analysis is the existence of a lot of independent variables usually obtained by different instruments. Variable selection can increase interpretability and accuracy of a model due to parsimonious representation (Farrés et al., 2015). One of the most efficient variable selection methods is the variable importance in projection-partial least squares (VIP-PLS), which was introduced by Lennart et al. (2006). The PLS method is used to understand the correlation among all predictor variables (X) and response variables (Y), and the strength of the relationship between each predictor variable and the dependent variable is indicated by VIP index.
The VIP values measure the importance of each variable used by a particular PLS model via their coefficients in every component (Rajalahti et al., 2009), and variables with VIP ≥ 1 are considered the most relevant to classify samples into different classes (Rajalahti et al., 2009).
ANN is among the most useful and applied chemometric methods, which can be used for clustering, classification, modeling of a property, noise filtering, predictions, modification of defects, and etc. (Moldes et al., 2017). ANNs have some advantages as follows: flexibility, compatibility, handling nonlinear relationships, and applicability to complex variables and predicaments (Astray et al., 2013).
Counter propagation artificial neural network (CPANN) is one of the most commonly used versions of artificial neural networks that are applied for supervised classification purposes. Neurons of this network mimic the action of a biological neuron in response to an external stimulus (Ballabio et al., 2009).
The CPANN method is able to solve the nonlinear mapping problems which PCA and LDA are not able to solve (Melssen et al., 2006). The CPANN architecture has two layers of neurons: a Kohonen layer and an output layer (Grosberg layer) that performs the mapping of the multidimensional input data into the lower dimensional array (two dimensions) (Neiband et al., 2017). The Kohonen layer is composed of a grid of N 2 neurons (where N is the number of neurons). The CPANN training process is similar to self-organizing maps (SOM). The most similar neuron to the input vector is called the "winner neuron." According to the location of the winner neuron in the Kohonen layer, the weights of the net are frequently updated on the basis of the input object. During training, all samples are placed in the mapping according to the assigned classes, and the number of the weights of the neurons is the same as the number of the variables in the original data. More detailed about the theory of CPANN algorithm can be found in literature (Neiband et al., 2017).
The purpose of this work was to determine lime juice fraud based on FT-IR spectroscopy and different pattern recognition methods.
Very diverse set of natural, synthetic and adulterated lime juice samples were prepared and used for development of the discriminative models. Since the number of wavenumbers in collected FT-IR data was too much, the VIP-PLS-DA algorithm was used as a supervised dimension reduction technique. This method significantly filters the variables and removes redundant wavenumbers from original data matrix. The abstract space selected using VIP-PLS-DA algorithm was used as input for CPANN in order to development nonlinear classification models. In fact, the present contribution investigated for a list of functional groups and transmittance values which could make a reasonable discrimination between different classes of original and artificial lime juice samples.

| Chemicals
Ascorbic acid was purchased from Rankem, Co (Okhla). Citric acid, potassium bromide (Kbr), and sodium metabisulfite were purchased from Sigma Chemical Co. Brilliant Green was purchased from Hangzhou Mike Chemical Instrument Co. Ltd.

| Sample reconstruction and collection
In this study, 101 lime juices were used for modeling, which can be divided into five general categories as follows: class 1, 9 sam- The unadulterated lime juices (samples of classes 3 and 4) were formulated based on the Iranian national standards of lime juice (INSO, 1996). Therefore, some of the indicators including color, pH, and Brix o were adjusted by Hunter-Lab scan XE-Spectrocolorimeter  Table 1.
Finally, after development of CPANN classification models, 20 brands of lime juices (collected from local markets) were evaluated and predicted based on the presented model.

| Preparation of samples for FT-IR spectroscopy
Two ml of each sample was dried with Christ Alpha 1-2 LD Plus (Germany) freeze dryer at −55°C and pressure of 1 mbar for 36 hr.
Then, 1 mg of the dried lime juice was blended with potassium bromide powder (KBr) and compressed as tablets. The FT-IR spectra were obtained with Nicolet IR 100 (Thermo Scientific). All spectra were rationed against a background air spectrum and gathered as transmit values at each data point. Each sample was replicated 5 times, and the spectral resolution was set at 8 cm -1 .

| Multivariate statistical analysis
The spectral information was saved as.xlx files by Encompass program to be analyzed with MATLAB software, version 8.3.0.532. The IR transmittances were recorded in 468 wavenumbers, and the size of the collected data matrix was 505 × 468. The collected FT-IR data were mean centered and then autoscaled according to the standard deviation of the wavenumbers. For this standardization, the FT-IR information (D 1 ) was transformed into the D 2 value according to the following equation (Neiband et al., 2017): Where D m and δ s are the mean and the standard deviation of the FT-IR spectra for each wavenumber, respectively.
In this study, at first, the PCA method was applied for clustering of the lime juice samples. The results of PCA analysis will give a preliminary view on the collected data and will guide for deciding about the selection of the appropriate classification algorithm for discrimination of samples.
The best discriminatory variables in this work were selected using VIP-PLS-DA algorithm, and subsequently, the samples were classified according to variables with the VIP > 1. These variables were used as inputs for development of CPANN models. The number of neurons (NN) and training epochs (NE) were optimized using the classification toolbox in MATLAB. The optimized NN and NE were equal to 20 and 200, respectively. Eventually, the performance of the CPANN model was compared to those of PLS-DA, and the models were evaluated using an external test set and Y-randomization test.

| Spectral exploration
The FT-IR spectra of different samples are shown in Figure 1a. As can be seen in this figure, all spectra are rather complex and multivariate statistical methods are needed to explore the differences between the samples.
The intense bond in the region of 2500-3500 cm -1 shows the existence of free and intermolecular bonded hydroxyl groups of carboxylic acids. The functional groups of -COOH and -COOCH 3 are indicated at 1727 cm -1 . The peak of 1,257 cm -1 is attributed to the CO groups of aliphatic acids. The bands of the fundamental molecular vibrations of CC, CO, and CCO of sugar are assigned at 1,050 cm -1 . The The compositions of different reconstructed lime juice samples in this work     Table 2. As can be seen in Table 2, the accuracy of the model is reasonable (96%), and the other classification parameters (i.e., specificity, sensitivity, and precision) revealed high power of the model for detection of frauds in lime juice samples. It is worth to mention that the sensitivity and precision rates for class 4 are lower than those of other classes, which was probably due to the presence of additives such as metabisulfite, ascorbic, and citric acids in the Chinese concentrate.

| Counter propagation artificial neural networks coupled with VIP-PLS-DA analysis
The confusion matrices of the CPANN model for training and validation procedures are given in Table 3. As can be seen in Table 3, while almost all the samples were classified in their own groups, some samples of class 4 were categorized in class 5, which is due to the similarity between these classes. Furthermore, some samples of class 1 were classified in class 5, which is probably due to the specific composition of some samples in class 5, which contained lime waste and concentrates.
The prediction ability of the developed CPANN model was more evaluated using the receiver operating characteristic curves (ROC).
The area under the ROC curve (AUC) can be used as a measure for addressing the discriminatory power of the classification models.
The ROC curves of CPANN model annotated to each groups of samples are illustrated in Figure 4. The prediction ability was acceptable for all classes although for class 4, it was lower than the others. This is in agreement with those seen in Tables 2 and 3 for this class of samples.

| Comparison with conventional linear classifier and external validation
In order to compare the CPANN model with other methods, the PLS-DA analysis was also conducted in this research for classifica-

| D ISCUTI ON
This research showed that FT-IR data coupled with the VIP and CPANN model could be an effective method for frauds detection, while researchers were looking for detection of specific compound or ratio between compounds as an indicator for fraud detection.
European Fruit Juice Association measured some of the indicators, such as Brix, relative density, citric acid, D-isocitric acid, ash, flavonoids (hesperidin and eriocitrin), glucose, and fructose, for detection of fraud in lime juice. However, these indicators are not sufficient to determine fraud (AIJN, 2015). Lorente et al., (2014) reported that these indicators were not appropriate for fraud detection in Spanish lemon juice because some of the parameters were not reached or exceeded from the range determined by European Fruit Juice Association. However, they confirmed that some of the measured parameters such as D-isocitric acid, citric acid, and potassium ion concentration could differentiate between natural and reconstructed groups of lime juices.
The list of some analytical methods for determination of fraud in lime juice samples is given in Table 6. In this regard, Asemi isotope ratio (δ 13 C/ 12 C) of citric acids, glucose, and fructose in lime juices. The results of their study showed that lemon juice fraud can be determined using these ratios with a confidence level of 95%.
However, Haminiuk et al., (2012) and Wang and Jablonski (2016)  indicated that phenolic and flavonoids compounds could not be a good indicator for fraud detection because growth conditions, rainfall, dewatering methods, and other factors can affect the amount of these compounds. Therefore, in addition to being expensive and difficult, these methods may not be able to detect fraud in very diverse sets of samples.
Another effective way to identify fraud and authenticity is to use inexpensive techniques (FT-IR and NIR) in conjunction with chemometrics, which are considered as effective methods for investigating a list of functional groups. In this regard, Shafiee and Minaei (2018), using NIR and chemometrics, investigated the originality of lime juices. The results of their work showed that PCA as an unsupervised method is not always suitable, while supervised methods such as SVM after variable selection were effectively able to classify and distinguish classes (more details are given in Table 6). However, this study is not comprehensive enough to determine authenticity due to the limited number of the samples.
Another method used to detect fraud was inductively coupled plasma-mass spectrometry (ICP-MS) for determination of twentyfive elements. The results indicated that this method was able to attribute unknown lemon juice samples to their geographical origins.
SVM had a better performance in this type of classification compared to RF (random forests), k-NN (k-Nearest Neighbor), LDA, and PLS-DA (Gaiad et al., 2016).
In addition, the use of spectroscopic techniques coupled with chemometrics plays a major role to detect authenticity of juices. Hohmann The FT-IR spectroscopy as a screening technique, in contrast with HPLC and HPLC-CO-MS, is inexpensive, fast, easy, adaptable with chemometrics and can provide information on functional groups to classify samples. In this study, the collected FT-IR spectra were mathematically explored by the VIP method. In this process, the informationrich wavenumbers of FT-IR spectra that effectively created differences between the groups were selected. Based on the calculated VIP values, the important variables were separated from the whole data and a CPANN model was built using these wavenumbers as inputs.  (Kiralj & Ferreira, 2009;Tropsha & Golbraikh, 2007).
In addition, another advantage of this study was the variety and high diversity of the samples. With very diverse sets of samples, the classification model will be more accurate and reliable. identifying unauthorized additives (such as lime waste, citric acid, ascorbic acid, metabisulfite, salt, and sugar) using an inexpensive and precise method. Therefore, this work was not limited to natural lime juices and synthetic samples, as it also dealt with a number of dangerous frauds (more details are given in Table 6).
Finally, in order to check the market samples, 20 brands were evaluated based on their FT-IR spectra and the trained CPANN model. Interestingly, all samples were thoroughly categorized in class 5 (accuracy = 100% for validation set), which could be a warning to the food industry and consumers. The presence of unauthorized components (metabisulfite) in lime juice will cause harm to consumers (such as allergic reactions and dyspnea) and will also reduce the credibility of the food industry.

| CON CLUS IONS
In this work, the authenticity of commercial lime juice was detected and quantified using FT-IR spectroscopy coupled with the VIP variable selection and CPANN models. The main advantage of the present contribution is the diversity of the calibrating samples which include broad ranges of natural, synthetic, and adulterated lime juice samples. Therefore, applicability domain of the developed discriminative model in this work would be broad and wide which is a needed property in fraud detection in lime juice industry. The calculated VIP values indicated that the molecular functional groups with IR signals in the ranges of 741-756, 1697-1743, 1859-1944, 2013-2198, 2916-2939, and 3950-4003 cm -1 are important to detect fraud in lime juice samples. The classification results showed that FT-IR spectroscopy in conjunction with CPANN model has a great potential to be used as an alternative method for screening purposes and detection of frauds in lime juice industry. Moreover, the developed model was validated using Venetian blind cross-validation and using Y-randomization tests. The CPANN model was also compared with conventional PLS-DA algorithm. The results indicated that the construction of the model was successful, and good classification results were not due to chance correlation. Therefore, the developed CPANN model in this work can be of interest for food quality control institutions and food safety organizations to reduce lime juice frauds.

ACK N OWLED G M ENTS
The authors would like to thank Tarbiat Modares Research Council for its financial support.

E TH I C A L A PPROVA L
On behalf of all coauthors, I, Dr. Mohsen Barzegar, declare that this article has not been published in or is not under consideration for publication elsewhere. All authors were actively involved in the work leading to the manuscript and will hold themselves jointly and individually responsible for its content. Also, there is no conflict of interest in this paper.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available from the corresponding author upon reasonable request.