Functional approach and agro‐climatic information to improve the estimation of olive oil fatty acid content from near‐infrared data

Abstract Extra virgin olive oil (EVOO) is very appreciated by its taste, flavor, and benefits for health, and so, it has a high price of commercialization. This fact makes it necessary to provide reliable and cost‐effective analytical procedures, such as near‐infrared (NIR) spectroscopy, to analyze its traceability and purity, in combination with chemometrics. Fatty acids profile of EVOO, considered as a quality parameter, is estimated, firstly, from NIR data and, secondly, by adding agro‐climatic information. NIR and agro‐climatic data sets are summarized by using principal component analysis (PCA) and treated by both scalar and functional approaches. The corresponding PCA and FPCA are progressively introduced in regression models, whose goodness of fit is evaluated by the dimensionless root‐mean‐square error. In general, SFAs, MUFAs, and PUFAs (and disaggregated fatty acids) estimations are improved by adding agro‐climatic besides NIR information (mainly, temperature or evapotranspiration) and considering a functional point of view for both NIR and agro‐climatic data.

Raman (FT-Raman) spectroscopy-generate continuous information too, but derivatization is not necessary, being reliable, rapid, and cost-effective. IR and Raman spectroscopies can be considered as complementary techniques in the identification of unknown substances in chemical samples. But, an advantage of IR over Raman is the cost because Raman spectroscopy needs high-powered lasers and amplification sources to get sensitive results, and even, the intense laser radiation can destroy the sample. Besides, IR spectroscopy has been an understood established technique for much longer, so IR provides a greater sensitivity and reliability than Raman techniques. Moreover, in the comparison of NIR vs. MIR, NIR requires more flexible sampling arrangements and cheaper, rugged instrumentation than MIR. Definitely, the utility of NIR is highlighted. The application of multivariate statistics to NIR spectra allows to obtain qualitative or quantitative information of EVOO (Berrueta, Alonso-Salces, & Héberger, 2007), being useful to avoid fraudulent practices in the oil sector. There are many studies in the literature of the application of chemometrics to EVOO NIR spectra, specially, with the main aim of its authentication and evaluation of quality parameters. These works show how NIR spectra contain useful and valuable information about EVOO. For instance, NIR spectra have been used for the determination of geographical origins, Protected Designations of Origin (PDO), or compositions (mainly, the fatty acid profile) (Bertran et al., 2000;Casale et al., 2012;Galtier et al., 2007;Mailer, 2004;Sánchez-Rodríguez et al., 2013, 2014Woodcock, Downey, & O'Donnell, 2008).
Moreover, there are many works analyzing the influence of weather, agro-climatic, or meteorological conditions on food content, in general, such as berries (Yang, Laaksonen, Kallio, & Yang, 2017), castor beans (Falasca, Ulberich, & Ulberich, 2012), currants (Zheng et al., 2012), grapes (Luciano, Albuquerque, Rufato, Miquelluti, & Warmling, 2013), mangos (Rymbai et al., 2014), sweet potatoes (Edmunds, Clark, Villordon, & Holmes, 2015), pineapples (Dorey, Fournier, Léchaudel, & Tixier, 2016), or wheat (Khokhar et al., 2017). In particular, many papers treat the effect of these agro-climatic conditions on olive oils (Awan, 2014;Ozdemir, 2016;Veizi, Peçi, & Lazaj, 2016;Zaied & Zouabi, 2016). But there are few works considering NIR data to study this agro-climatic influence on oils or other food products. And, in relation to the multivariate statistical technique that has been applied, all the previous studies consider a non-numerical variable (i.e., a factor) to differentiate among agro-climatic or meteorological groups. This factor can be subsequently used, for example, as an independent variable in an analysis of variance model or as a dependent variable in linear discriminant analysis. Nevertheless, this paper uses the complete agro-climatic database obtained from the official webpage of the Automatic Weather Stations (AWSs) of Andalusia, instead of clustering the information in groups. More specifically, the historical daily information has been downloaded, from 2005 to 2010, for the following variables: temperature, humidity, wind speed, radiation, precipitation, and evapotranspiration.
Furthermore, functional data analysis (FDA) is a relatively recent statistical method concerned with the analysis to any data set that can be thought of as a function or a curve (i.e., an infinite-dimensional variable). FDA was initially popularized by Ramsay and Silverman (2007), and it is actually one of the most active fields of investigation in data science, in general (Aneiros, Cao, Fraiman, Genest, & Vieu, 2019). In particular, the potential of FDA to characterize, compare, and classify chemical data has been analyzed by Burfield, Neumann, and Saunders (2015). But, although FDA has been applied to some examples of NIR data (Aguilera, Escabias, Valderrama, & Aguilera-Morillo, 2013;Saeys, De Ketelaere, & Darius, 2008), in no case olive oils data were treated by using this approach.
The aim of this work was to determine the profile in fatty acids of EVOO from NIR spectral data, in a first step, and to analyze whether the goodness of fit of the estimation can be improved by also consid- That is why, in this work, PCA and FPCA components are progressively introduced in the models. The reliability of these regression models is compared by using the dimensionless root-mean-square error (DRMSE), taking into account the scalar or functional approach of data and the number of retained components (considering the recommendations in the literature respect to the optimal number of components to avoid overfitting (Hawkins, 2004)). Finally, estimations for some disaggregated fatty acids (in particular, palmitic, stearic, palmitoleic, oleic, linoleic, and linolenic) are also determined as the trade standard of olive oil is established based on particular fatty acids.

| Chemical data
This study is based on data obtained from 222 Andalusian EVOOs collected from 2005 to 2010. Olive oil was either extracted by the producers through a two-phase centrifugation system or extracted by the staff of the Agronomy Department of University of Cordoba with an Abencor System (which reproduces the industrial process on the laboratory scale and follows the same stages of grinding, beating, centrifugation and decantation).
Samples were kept in the fridge in order to their properties were not modified (Baeten, Aparicio, Marigheto, & Wilson, 2003). μL of injection volume, and 260 ºC of detector temperature. The oven temperature was programmed to remain at 180 ºC for 15 min and then raised to 240°C at a rate of 4°C/min and maintained at this temperature for 5 min. The triacylglycerol samples (olive oil samples) were submitted to a cold transesterification procedure to convert the triacylglycerol into fatty acid methyl esters. This method is indicated for edible oils with an index of acidity lower than 3.3º: Firstly, 0.1 g of olive oil is transferred into a 5-mL volumetric flask; secondly, 2 ml n-heptane and 0.2 ml of a 2N KOH solution in methanol were added, and the reaction mixture was vigorously stirred; finally, the methyl esters were extracted and subject to GC analyses.

The EU Commission Delegated Regulation (2016) and the
International Olive Council (2012) establish the characteristics of olive oils to determine purity criteria in order to authentication and avoid adulterations with lower quality oils. Particularly, the limit values for fatty acids are regularly updated taking into account the indications of chemical experts and are shown in Table 1.

| Agro-climatic data
The Spanish official webpage of the Andalusian Institute of Agricultural, Fisheries, Agrifood, and Organic Production Research and Training (at https ://www.junta deand alucia.es/agric ultur aypes ca/ifapa/ ria/servl et/Front Contr oller ) provides the long-run information registered in the Automatic Weather Stations (AWSs). Therefore, this website has been used to obtain the agro-climatic data of the study: Historical data can be downloaded once selected the name of the station, the agro-climatic measurements, and the start and end dates. There are approximately 120 AWSs in all the Andalusian provinces, with a suitable plan of maintenance and an exhaustive review of the records that supply the sensors. This work only contemplates the daily information obtained, from 2005 to 2010 (years previous the corresponding oil harvests), for the 28 AWSs specified in Table 2, selected due to their proximity with the cardinal points of extraction of oils.
Information about the following variables has been downloaded from each AWS: Temp, daily average temperature (ºC); Hum, daily average relative humidity (%); WSpe, daily average wind speed (m/s); Rad, daily average radiation (W/m 2 ); Precip, daily precipitation (L/ m 2 ); and ETo, the evapotranspiration is the loss of dampness (mm/ day) of a surface for either direct evaporation or the water loss for perspiration of the vegetation. Technical information about the measuring instruments can also be obtained from the above-mentioned link. Figure 2 depicts the agro-climatic series for the observed period (2,191 days, in total, as there is a leap year) and the 28 AWSs.
Taking into account the discrepancies among the curves corresponding to the different AWSs, a computer program has been designed by using the R-project (Team RC, 2018) that permits to associate to each EVOO the agro-climatic curve corresponding to the year which is preceding to the olives harvest and to the nearest AWS (or the average of the nearest AWSs), for the different six agro-climatic variables (Temp, Hum, WSpe, Rad, Precip, and ETo). In particular, the programmed R-function has the following arguments: station, harvest year, month1-month2, and agro-climatic variable and returns as value the aggregated agro-climatic measurement according to the previous selection. Detailed information of the R code is included in the Supplementary Material. Furthermore, the agro-climatic measurements have been accumulated in order to relate them more adequately to the phenological cycle of the olive grove, which could directly influence the composition of the oil. As shown in Figure 3, this cycle is not equally distributed, and therefore, the months of each period could be studied independently. In the same line, Orlandi, Bonofiglio, Romano, and Fornaciari (2012) study of the influence of climate data on oil production in southern Italy by considering meteorological variables on a monthly basis.

| Statistical methodology
NIR and agro-climatic data provide both huge databases. On one side, NIR spectra associated with each EVOO is the result of measuring the absorbance in more than a thousand wavelengths (1,237).
On the other side, six agro-climatic variables (temperature, humidity, wind speed, radiation, precipitation, and evapotranspiration) can be assigned to each EVOO. Each agro-climatic series is formed by 2,192 data (corresponding to the daily measurements during six consecutive years, with a leap year).
NIR and agro-climatic data can be seen either as a scalar view (i.e., as an extensive discretization of points) or as a functional view (i.e., as a curve, observed in an interval). Problems tackled by statistical techniques with functional data are, basically, the same of the classical the fit of regression models to predict the content in fatty acids of EVOO as a function of (scalar or functional) NIR and agro-climatic data, considering the values obtained by GC as a reference. These models initially contain only NIR information, and then, one of the six agro-climatic variables is also included among the regressors. All The objective is not only to reduce the dimensionality of high-dimensional data sets on a reduced number of components displaying them in a space of a less dimension than but also, fundamentally, to use these PCA and FPCA components to predict the fatty acid profile of EVOO in regression models.
Furthermore, the literature includes many works analyzing the optimal number to be retained in principal component analysis (Saccenti & Camacho, 2015;Vitale et al., 2017). Some of them consider the possibility of progressively including components in the model until one does not significantly increase the explained variability of data. In particular, the classical and ad hoc Kaiser's rule (by default in many statistical software) suggests that those factors explaining a percentage of variability equal or higher than 10 (technically, with eigenvalues equal or higher than 1) should be retained. Some authors do not recommend using this cutoff criterion as it constitutes a case-specific strategy (not easily generalizable for data of various nature), and it tends to extract too many factors and so over-extracts components. The overfitting of statistical models is not recommended as it could introduce noise in the regression coefficients and cause some problems in the verification of hypotheses of linear models (Hawkins, 2004). Other more recently Regarding the software, the R-project (Team RC, 2018) has been used to connect the databases of NIR and agro-climatic curves; then, the packages of "pls" (Wehrens & Mevik, 2007), "fda" (Ramsay,

| RE SULTS AND D ISCUSS I ON
In this section, regression models are fit to predict the fatty acids profile of EVOO (obtained by GC as a reference), firstly, as a function of the NIR information and, secondly, when the agro-climatic daily data are added to the model. NIR spectra and agro-climatic curves are treated by scalar and functional points of view, being in both cases summarized by principal component analysis (PCA and FPCA, respectively). The goodness of fit of the statistical models is compared by using DRMSE and taking into account de number of components retained in the model.
In order to determine the number of (scalar or functional) principal components to be retained, Table 3 includes this optimal number of components when a classical ad hoc (Kaiser's rule) or a computational (cross-validation) criterion is considered. Both PCA and FPCA regressions are contemplated, when either only NIR or also agro-climatic information is added. The results for the classical criterion are different to the results for the actual criterion, but, in general, the results suggest that, in general, to retain more than around ten components could cause overfitting in the regression model. For this reason, the following figures contemplate de-evolution in DRMSE when PCA and FPCA components, from one to ten, are introduced in the regression models.  and PUFAs fatty acids of EVOO by using only NIR spectral information (black lines) or also adding a specific agro-climatic information a In all cases, the percentage of variability of data explained by the selected components is greater than 85%. PUFAs fatty acids, in general), the situation is the opposite for a number of components greater than seven-eight (a higher number than recommended to avoid overfitting in Table 3).
• As dimensionless RMSE (DRMSE) are represented in y-axis with the same range (0 to 0.4), the values for the different graphics can F I G U R E 5 RMSE in SFA estimations by PCA * and FPCA ** regression models from NIR and agro-climatic data F I G U R E 6 RMSE in MUFA estimations by PCA * and FPCA ** regression models from NIR and agro-climatic data F I G U R E 7 RMSE in MUFA estimations by PCA * and FPCA ** regression models from NIR and agro-climatic data be compared. Although Figure 4 depicts that the most accuracy

| CON CLUS IONS
Last years, fast, reliable, and cost-effective analytical procedures have been established in studies about purity, authentication, and traceability of olive oils. In this sense, NIR spectra have been habitually used, in combination with chemometrics, to determine interesting qualitative and quantitative information about olive oils.
Moreover, the literature contains many works analyzing the influence of agro-climatic conditions on food components, in general, and on olive oils, in particular. But all these works contemplate this agro-climatic information as a factor, a non-numerical variable.
Furthermore, FDA actually constitutes an active field of investigation in data science, being used with chemical data, in particular, with NIR spectra. Nevertheless, FDA has not been applied to olive oil data.
Therefore, this work highlights that NIR spectra are particularly useful to estimate MUFAs fatty acids (in particular, oleic fatty acid). But the reliability or goodness of fit of all fatty acids predictions (SFAs, MUFA, PUFA, and for the disaggregated fatty acids: palmitic, stearic, palmitoleic, oleic, linolenic, and linolenic) can be improved by adding agro-climatic data (specially, temperature and evapotranspiration) in the regression models.
The high-dimensional information contained in NIR spectra and agro-climatic curves is summarized by using principal components analysis, where both scalar and functional approaches are used.
The corresponding PCA and FPCA components are progressively introduced in regression models, whose goodness of fit is measured by DRMSE (dimensionless RMSE, useful in comparisons).
The classical Kaiser's rule and the actual cross-validation have been applied to determine the optimal number of components to be retained in the regression models (being obtained, in general, values around ten). The results show how the functional point of view and the use of both NIR and agro-climatic information is better in the estimation of the fatty acids profile for a low number of components, the ideal situation to avoid the overfitting. Finally, as the International Olive Council (2012) establishes the characteristics of purity criteria for olive oils by using disaggregated fatty acids (see Table 1), DRMSE is depicted for palmitic, stearic, palmitoleic, oleic, linolenic, and linolenic fatty acids under the same previous assumptions. Although MUFAs estimations are, in general, the best, the disaggregated estimations for palmitoleic and oleic are different in reliability, being the last ones considerably better in goodness of fit.

ACK N OWLED G M ENT
The authors thank all those who have helped in carrying out the research.

CO N FLI C T O F I NTE R E S T
The authors declare no conflict of interest.

E TH I C A L S TATEM ENT
This study does not involve any human or animal testing.