Establishment and application of risk classification model for lead in vegetables based on spectral clustering algorithms

Abstract This study aims to evaluate the risk of lead pollution in 9 kinds of vegetables consumed by residents in 20 provinces/cities of China. Sampling data and vegetable consumption data from 20 provinces/cities in 2019 were used. Combined with dietary exposure assessment, the vegetable categories and provinces were paired, and a risk classification model based on spectral clustering algorithms was proposed. The results of the spectral clustering algorithm showed that the risk level of lead pollution in vegetables can be divided into five levels. The combination of vegetable‐province/cities at the risk level of 1 and 2 accounted for 92.78%, and that at the risk level of 4 and 5 accounted for 2.22%. The high‐risk combinations were fresh edible fungus–Shaanxi, fresh edible fungus–Sichuan, and fresh edible fungus–Shanghai and bean sprouts–Guangdong. In the proposed model, objective data were used as the classification index, and the spectral clustering algorithm was employed to select the optimal risk classification in a data‐driven way. As a result, the influence of subjective factors was effectively reduced, the risk of lead pollution in vegetables was classified, and the results were scientific and accurate. This study provides a scientific basis of supervision priorities for regulatory departments.


| INTRODUC TI ON
Lead is a nonessential metal element for plants (Naiming, 2013).
When lead is absorbed by plants and accumulated to a certain extent, crop quality can be affected (Jinda et al., 2005). Once lead enters the body through consumption, it is harmful to human health (Honghong et al., 2020). Low concentrations of lead damages the human nervous system and kidney, and high concentrations of the lead causes cancer or death. A large number of studies have proved that lead is one of the important pollutants affecting food safety and human health (Kaiser, 1998;Markus & McBratney, 2001).
Vegetables are an important part of the human diet and one of the main sources of minerals and vitamins required by humans (Grusak & DellaPenna, 1999;Welch & Graham, 1999). With the improvement of living standards, the demand for pollution-free vegetables in China is increasing day by day.
In recent years, lead pollution in crops and its health risk to the human body has been highlighted in research (Rongguang  Agency (US EPA). The reference dose (RfD) of lead is 3.5 μg/(kg d), the lower limit of 95% confidence interval for 1% reference dose (BMDL 01 ) is 0.6 μg/kg bw per day.

| Model building
According to the risk assessment method and the model purposes, the main influencing factors of health risks caused by food pollutants were considered, and Nemerow integrated pollution index (NIPI), hazard index (HI), and margin exposure (MOE) were selected as the indexes of the model. The average content, median (P50), 95th quantile (P95), and maximum value of lead in vegetables were selected as the characteristics of food pollution to calculate the exposure of lead under different pollution levels.
According to the sampling data in different provinces/cities, the lead pollution degree of sampling data is calculated by using the NIPI as follows: where P i,j is the pollution index of vegetable j in province/city i; X i,j is the detection value of lead content in vegetable j in province/city i (mg/kg); S j is the national limit standard of lead in vegetable j (mg/kg).
where P c (i,j) is the NIPI of vegetable j in province/city i; P max (i,j) is the maximum pollution index of vegetable j in province/city i; P ave(i,j) is the average value of pollution index P i,j of vegetable j in province/city i.
The HI is used to characterize the noncarcinogenic risk of lead in vegetables by lead exposure and reference dose. The expression is as follows: where HI i,j is the lead noncarcinogenic risk of vegetable j in province/ city i; RfD is the oral reference dose of lead (μg/kg d); EDI 95 i,j (estimated daily intake) is the estimated daily intake of lead through vegetable j in the province/city i under high exposure (P95); FC i,j is the average daily consumption of vegetable j in province/city i (kg/d); X 95 i,j is the 95th quantile (mg/kg) of lead sampling content in vegetable j in province/ city i; W is the average body mass of residents (60 kg).
The MOE is used to characterize the chronic dietary intake risk of lead by the lower limit of 95% confidence interval of 1% benchmark dose of lead and the exposure. The expression is as follows: where MOE i,j is the risk of chronic dietary intake of lead in vegetable j in province/city i ; EDI 50 i,j (estimated daily intake) is the estimated daily intake of lead from vegetable j in province/city i under moderate exposure (P50); FC i,j is the daily consumption of vegetable j in province/ city i (kg/d); X 50 i,j is the median lead content of vegetable j in province/ city i (mg/kg); W is the average body mass of residents (60 kg).

Computational environment
In this study, the Windows10 64 system was used as the experimental environment of spectral clustering algorithm; the Intel(R) Xeon(R) E5-1620 v4 @3.50GHz was used as processor; the running memory was 64GB; the NVIDIA GTX 1060 Ti was used as the data accelerator, and Python and related libraries were used as the experimental programming language.

Clustering classification
The possibility of excessive pollutants, exposure, and harmfulness of food pollutants were considered to quantify the risk factors of pollutants, and the food safety risk assessment model was established by the above three indexes. Clustering is a process of dividing a given sample into multiple clusters to obtain samples in the same cluster with high similarities and different clusters with low similarity. Clustering analysis can be used to mine deep information of data.
With a low sensitivity to sample shape and good support for highdimensional data, the spectral clustering algorithm can achieve good clustering performance in the arbitrary shape of sample space and is suitable for analyzing the model data of this study.
Scientific and accurate determination of classification level is one of the main problems of food safety risk classification. In this study, the clustering algorithm was used to determine the risk level of food pollutants. Through calculating the Calinski-Harabasz (CH) index of different parameters-cluster number combination (the larger the CH index, the smaller the total similarity between clusters, and the better the clustering effect), the spectral clustering algorithm was used to select the optimal parameter and cluster number combination, carry out risk classification in a data-driven way, and eliminate the subjectivity of risk classification.
The main process of spectral clustering algorithm was as follows: 1. A matrix W describing the characteristics of the sample was constructed from the data sample.
2. The eigenvalues and eigenvectors of matrix W were calculated and sorted.
3. The eigenvectors corresponding to the first k eigenvalues after sorting were taken, and the vectors were arranged according to the column direction to form a new solution space.
4. The K-means clustering was adopted in the new solution space, and finally, the clustering results were mapped to the original solution space.

Data processing
According to the principle of Low levels of pollutants credible evaluation proposed at the second meeting of WHO Global Environmental Monitoring System/Food (GEMS/FOOD), when the proportion of undetected data was less than 60%, all undetected data shall be replaced by 1/2 of the limit of detection (LOD); when the proportion of undetected data was higher than 60%, all undetected data are replaced by LOD (Xuqing et al., 2002). In this study, the undetected data were given 1/2 LOD value for statistical calculation.

| Lead pollution in vegetables
As shown in Table 1, the vegetables with the highest average lead content were fresh edible fungus in Shaanxi Province (0.0662 mg/ kg), followed by fresh edible fungus in Shanghai City (0.0656 mg/kg), and root and potato vegetables in Sichuan Province (0.06 mg/kg).
China Food Safety Standard specifies the limit index of lead in vegetables, including 0.3 mg/kg for Brassica vegetables and

| Model index results
According to the calculation rules of model indexes, the index values of various vegetables in 20 provinces/cities were obtained, as shown in Table 3.

| Nemerow integrated pollution index
According to

| Hazard index
The lead risk of noncarcinogenic dietary intake in all provinces/cities was extremely low, and the highest HI was only 0.4705. Therefore, the lead risk of noncarcinogenic dietary intake by eating vegetables was acceptable.

| Margin exposure
The risk of chronic dietary intake in all provinces/cities was extremely low, and the highest average daily exposure was melon vegetables in Hunan Province (0.317 μg/kg bw), MOE is 1.8931.
Therefore, the lead risk of chronic dietary intake by eating vegetables is acceptable.

| Risk classification results based on spectral clustering algorithm
The parameters in the algorithm were selected from 1 to 10, and the number of clustering categories was selected from 3 to 7. Table 4 showed the scores of some combinations. According to combination with the highest score was 293 points with 5 parameters and 5 cluster categories. Therefore, the risk level of dietary lead intake in vegetables was divided into five levels.
The spectral clustering algorithm was used to determine the risk classification model of lead dietary intake in vegetables, the risk classification results of various vegetable-province combinations were obtained, as shown in Table 5. The combinations with risk levels above level 3 were sorted in descending order, as shown in Table 6.
As shown in Table 5, the combination of vegetables-province with the risk level of 1 and 2 accounted for 92.78% of the total, and that with the risk level of 4 and 5 (high-risk level) accounted for 2.22%. Table 6 showed that the high-risk combinations were fresh edible fungus-Shaanxi, fresh edible fungus-Sichuan, fresh edible fungus-Shanghai and bean sprouts-Guangdong. Vegetables with relatively high risk were fresh edible fungus, bean sprouts, roots, and potato vegetables.

| DISCUSS ION
The results of dietary exposure showed that the high and medium lead exposure of residents in all provinces/cities through vegetables were lower than the corresponding reference dose or benchmark dose; most of them were more than 5% of the reference dose, and some exceeded 20% of the reference dose. According to the rel-   study, and the clustering algorithm was employed to automatically select the optimal risk classification in a data-driven way. In actual food supervision, the order of the management priority is affected by many factors. The results of risk classification can provide a basis for regulators to set management priorities based on health risks, but food safety management cannot be carried out simply according to a model or formula (Batz et al., 2004). In this study, the risk assessment and classification methods were applied by using the sam-

CO N FLI C T O F I NTE R E S T
The authors declare that they do not have any potential sources of conflict of interest.

E TH I C A L A PPROVA L
This study does not involve any human or animal testing. This study was Prevention and conducted in accordance with Helsinki's Declaration.

TA B L E 5 Risk classification results of vegetables in provinces
Provinces Leafy

I N FO R M E D CO N S E NT
Written informed consent was obtained from all study participants.