Discriminating geographical origins of green tea based on amino acid, polyphenol, and caffeine content through high‐performance liquid chromatography: Taking Lu’an guapian tea as an example

Abstract Seventy‐three Lu'an guapian tea (LAGP) samples were collected from 11 growing locations in the city of Lu'an, Anhui Province, China. Through high‐performance liquid chromatography, 18 amino acids, along with gallic acid, caffeine, and five catechins, were quantitatively detected. Hierarchical cluster, correlation and principal component analysis, and a support vector machine were used for geographical discrimination. The findings suggested that the differences in tea quality between the inner and outer mountain regions are related to isoleucine, leucine, phenylalanine, and valine contents, with a correlation coefficient of more than 0.85. Principal component analysis combining with support vector machine was a feasible method. The identification rates for the inner and outer mountains were 97.96% in the training set and 95.83% in the prediction set. Furthermore, the identification rates for the three counties were 91.84% and 95.83% in the training and prediction sets, respectively.

. At present, most technique for tea identification (quality and variety) relies on sensory evaluation of color, aroma, taste, and appearance by the trained tea specialists. The obvious disadvantage about the sensory evaluation could be easily influenced by the taster's experience or environment. Teas from the same geographical origin and with the same processing method probably have a similar or typical composition. And such kind of chemical composition could afford their distinctive characteristics and also enable LAGP to be discriminated according to the geographical origins.

| Samples and chemicals
In total, 73 LAGP samples were collected from 11 village points, two counties, one district, and the inner and outer mountain regions  Table S1. The main difference between inner mountain tea and outer mountain tea is the altitude of the plantation. The samples were kept in aluminum foil bags, stored in a freezer at 4°C, and ground into powder before being strained using a 600-μm sieve for subsequent analysis.
High-performance liquid chromatography-grade acetonitrile was obtained from Tedia (Tedia Co.); GA, EGC, EGCG, C, EC, ECG (˃99.99%), caffeine, and Thea were purchased from Sigma. A reagent kit was obtained from Waters that included a mixture of 17 amino acid hydrolysate standards (Waters AccQ·Tag Amino Acid Standard H), the derivatization reagent AQC (Waters AccQ·Fluor Reagent), and eluent A (Waters AccQ·Tag Eluent A, concentrate). Milli-Qtreated water was prepared to purity higher than 18 MΩ cm using a water purification system (Aquapro International Co.).

| Amino acids
According to the ISO/WD 19563 (2017), 1.00 ± 0.01 g of sample powder in a 200-ml beaker and 100 ml of boiling water were added.
The sample was brewed on a magnetic stirrer (500 upm) for 5 min, filtered, and made up to a certain volume. Approximately, 1 ml of the sample solution was centrifuged at 16,000 g(rcf) for 10 min.

| Polyphenols and caffeine
According to the ISO 14502-2: 2005ISO 14502-2: (2006, 0.20 ± 0.01 g of powder was eluted in magnetic stirring apparatus (70°C) with 5 ml of methanol-water (70%) for 10 min. Then, it was centrifuged at 4,200 g(rcf) for 10 min and transferred into a 10-ml volumetric flask. Repeat extraction again and mix the two extracts to 10 ml.
The extract was filtered through a 0.22-μm Millipore membrane before injection.

| Statistical analysis
In this study, hierarchical cluster, correlation and principal component analysis were used for factors selecting. Hierarchical cluster analysis (HCA), an unsupervised learning process, divides similar objects into groups or more subsets through static classification, and members in the same subset have similar properties, which use Euclidean distance (Saito & Toriwaki, 1994) to calculate the distance (similarity) between different categories of data points. A more general alternative is the weighted Euclidean distance between two vectors x i and x j , For w k = 1 for each k = 1, … , K, Equation (1) reduces to the ordinary Euclidean distance.
Correlation analysis refers to the analysis of two or more related variable elements to measure the closeness of the two variable factors (Ezekiel, 1941). The Pearson correlation coefficient, the covariance of the two variables divided by the product of their standard deviations, was used to measure the degree of correlation in this study. Principal component analysis (PCA) (Wold, Esbensen, & Geladi, 1987;Yu, Wang, Xiao, & Liu, 2009), the simplest method for analyzing multivariate statistical distributions with feature variables, is often used to reduce the dimensionality of data sets while maintaining the maximum contribution of the variance within. The obtained results provide a rough classification of all samples. Support vector machine (SVM) (Vapnik, 2013) is a kind of generalized linear classifier that classifies data according to supervised learning. And the decision boundary is the maximum margin hyperplane for solving learning samples. The SVM can be nonlinearly classified by nuclear method and is one of the common nuclear learning methods, which uses the hinge loss function to calculate the empirical risk and adds a regularization term to the solution system to optimize the structural risk (Chen, Zhao, Fang, & Wang, 2007).
Moreover, SVM can learn in high-dimensional feature space with fewer training data. (1) The all-for-mentioned statistical analyses were performed using Matlab 2016a (MathWorks, Inc.). Correlation analysis and one-way ANOVA were performed using SPSS (version 22.0 for Windows).

| Amino acid, GA, catechin, and caffeine detection and analysis
In this study, the contents of all tested substances were quantified using HPLC and provided in Table S2. This study examined two classes of tea: inner mountain LAGP (IMGP) and outer mountain LAGP (OMGP, including YFD, 29K, and SBC). Table 1 shows the differences in the three major metabolite contents for each area. It was remarkable that the concentrations of GA, Met, Leu, Ile, and Val had significant differences (p < 0.05) among HSGP, JZGP, and YAGP; and the difference was almost 16%-60% between IMGP and OMGP in contents of EGCG, Ser, Lys, Phe, Val, etc. A heat map (Figure 3) was created using HemI (version 1.0) to visualize the content composition in different regions: Red, cyan, and blue areas indicate high, low, and moderate levels of chemical composition, respectively. All LAGP had the same relative contents: high EGCG, caffeine, EGC, ECG, and Thea but low Gly and Cys contents. Hierarchical cluster analysis, an unsupervised learning process, divides similar objects into groups or more subsets through static classification, and members in the same subset have similar properties, which use Euclidean distance to calculate the distance (similarity) between different categories of data points. Huoshan County guapian (HSGP; EC and DPD) and OMGP exhibited great difference with the other regions as shown in HCA.

| Correlation analysis
Correlation analysis between chemical composition and counties/ districts based on Pearson correlation coefficient was used, and the Pearson correlation coefficient is shown in Table 2  F I G U R E 3 Heat map of 25 variables and the results of hierarchical clustering for origins (Mao et al., 2018). GA content is significantly positively correlated with the quality grade of green tea, and GA is a characteristic component of green tea (Graham, 1992). These findings match with the results of sensory evaluation: Inner mountain tea has a less bitter, mellower taste than outer mountain tea does. However, the geographical tracing was negatively correlated with the geographical distance. Some sampling points, such as SJ and FH as well as QY and FH, border each other, and sometimes, they are even two sides of the same mountain. For example, the place adjacent to Jinzhai County and Huoshan County is the famous Dabie Mountains.

| Principal component analysis
Principal component analysis, as one of the commonly used multivariate analysis methods used for preliminary classification (Chen et al., 2015), was applied to extract the factors associated with production regions. Two principal components, explaining up to 96.87% of total variance, were revealed. Scoring plots are presented in

| Factor selection according to HCA, correlation analysis, and PCA
The final purpose of this study was to analyze the difference and discriminate LAGP from different origins using main chemical components combining pattern recognition. The results of factor selection

| Geographical tracing of LAGP through SVM combining with different factors
Support vector machine combining with the method of original variables and the factors selected by HCA, correlation analysis, and PCA all realized high discrimination statistically except for correlation analysis ( Table 4). The results revealed that the model could obtain a high identification rate combining with PCA. The training and prediction sets of the correct classification for inner and outer mountains were 97.96% and 95.83%, with only two samples discriminated incorrectly. However, the discrimination between counties was worse than for IMGP and OMGP; the correct discrimination rates of the training set and prediction set were 91.84% and 95.83%. These results indicated that distinguishing LAGP by using SVM is feasible.
The geographical origins of tea are difficult to trace because of the many associated factors necessary to determine origin. The composition of contained matter of tea can be easily changed by climate (Wang et al., 2011), agricultural practices in tea plantations (Wang et al., 2016), and tea processing (Huang, Sheng, Yang, & Hu, 2007). Of these, geographical location is one of the most critical factors. The discrimination rate in this study between the inner and outer mountain samples was high, possibly because of the relationship with altitude exhibited by the samples (Table S1). In addition, slight discrepancy of parameter caused by working operations among different production areas might affect geographical tracing, too. Huoshan County and Jinzhai County are traditional LAGP processing areas that emphasize "heat" during processing, which makes JZGP and HSGP highly dissimilar to YAGP and might also explain the reason that identification of YAGP was easier.

ACK N OWLED G M ENTS
This work was financially supported by the National Key Research and Development Plan (No. 2017YFD040800) and Modern Agriculture (tea) Special System of China .

CO N FLI C T O F I NTE R E S T
Huan Su, Weiquan Wu, Xiaochun Wan, and Jing ming Ning declare that they do not have any conflict of interest.

E TH I C A L R E V I E W
This study does not involve any human or animal testing.

I N FO R M E D CO N S E NT
Written informed consent was obtained from all study participants.