Pomegranate seed clustering by machine vision

Abstract Application of new procedures for reliable and fast recognition and classification of seeds in the agricultural industry is very important. Recent advances in computer image analysis made applicable the approach of automated quantitative analysis in order to group cultivars according to minor differences in seed traits that would be indiscernible in ocular inspection. In this work, in order to cluster 20 cultivars of pomegranate seed, nine image features and 21 physicochemical properties of them were extracted. The aim of this study was to evaluate if the information extracted from image of pomegranate seeds could be used instead of time‐consuming and partly expensive experiments of measuring their physicochemical properties. After data reduction with principal component analysis (PCA), different kinds of overlapping between these two types of data were controlled. The results showed that clustering base on all variables of image features contain more similar cultivars with clustering base on physicochemical properties (66.67% for cluster 1, 75% for cluster 2, and 50% for cluster 3). Therefore, by applying image analysis technique, the seeds almost were placed in different pomegranate clusters without spending time and additional costs.

Currently, image analysis is a well-established complement of morphology characterization. The image analysis technique allows the enhancement of images, as well as the identification and automatic isolation of particles for further study. In addition, it is a rapid and time-saving technique that allows for the acquisition of quantitative data that could be very difficult or even impossible to obtain otherwise (Amaral, Rocha, Gonçalves, Ferreira, & Ferreira, 2009). Pixels are basic components of images. Two kinds of information are contained in each pixel, that is, brightness value and locations in the coordinates that are assigned to the images. The former is the color feature while features extracted from the latter are known as size or shape features (Zheng, Sun, & Zheng, 2006).
It is then of major technical and economical importance to implement computer-based methods for reliable and fast identification and classification of seeds. Automatic systems can be based on seed images, from which classification features associated to seed morphological parameters and color are readily obtained. Thus, the field of machine vision, that is, image processing algorithms complemented with classification methods, seems a suitable framework for automatic seed identification (Granitto, Verdes, & Ceccatto, 2005). Besides, varietal identification of pomegranate is also of major interest in the horticultural industry.
During characterization processes, a large amount of data is usually obtained, therefore it becomes necessary for the use of statistical techniques to obtain accurate information about the seed characteristics. Multivariate analysis has traditionally been employed for foodquality evaluation. PCA is a frequently employed statistical analysis and has been successfully applied for data reduction (Castell-Palou, Rosselló, Femenia, & Simal, 2010;Kallithraka et al., 2001).
This study aimed to understand how much image features could be used in clustering of pomegranate seed. Therefore, clustering according to physicochemical features was first performed and then different types of image-based clustering were matched. Fruits were transported by a ventilated car to the laboratory as soon as harvested and defective pomegranates (sunburns, cracks, cuts, and bruises in peel) were discarded. The fruits were kept at 4°C until analysis.

| Physicochemical properties
Physicochemical properties and antioxidant activity were determined on 20 fruits randomly selected from each cultivar. Fruit volume was measured by liquid displacement method. Fruit density was estimated by Westwood (1993).
Fruit length and diameter were measured by a digital vernier caliper with 0.01 mm sensitivity. The measurement of fruit length was made on the polar axis. The maximum width of the fruit, as measured in the direction perpendicular to the polar axis, is defined as the diameter. Arils were separated and total aril sand peel per fruit was measured as above. The peel thickness was measured by a digital vernier caliper. Fruit juice content was measured by extracting of total arils per fruit using an electric extractor (model 5020, Toshiba, Tokyo, Japan).
The peel, aril, and juice percentage were calculated according to the method described by Zamani (1990).
After that, the major chemical compositions and antioxidant activity of pomegranate were analyzed.
The pH was determined with a digital pH meter (model 601, Metrohm, Herisau, Switzerland). Titratable acidity (TA) was characterized by titration to pH 8.1 with 0.1 N NaOH and presented as g of citric acid per 100 g of juice (AOAC, 1984).
Total soluble solid (TSS) was determined with a digital refractrometer (Erma, Tokyo, Japan). Total sugars were estimated according to the method described by Ranganna (2001), and ascorbic acid was determined by Ruck (1963).
Total anthocyanins were determined with the pH differential method (Giusti & Wrolstad, 2001) and the results were expressed as mg cyaniding-3-glucoside 100 g of juice. Total phenolics were measured colorimetrically at 760 nm using the Folin-Ciocalteu reagent (Singleton & Rossi, 1965). The results were expressed as mg gallic acid equivalent in 100 g of juice.
Antioxidant activity was assessed according to the method of Brand-Williams, Cuvelier, and Berset (1995).
Briefly, 100 μl of pomegranate juice diluted in the ratio of 1:100 with methanol: water (6:4, v/v) was mixed with 2 ml of 0.1 mmol/L 1,1-diphenyl-2-pycrylhydrazyl (DPPH) in methanol. The mixtures were shaken vigorously and left to stand for 30 min. Absorbance of the resulting solution was measured at 517 nm by a UV-visible spectrophotometer (model 2010, Cecil Instr. Ltd., Cambridge, UK). The reaction mixture without DPPH was used for the background correction. The antioxidant activity (AA) was determined by this relationship:

| Image acquisition
In the next stage, an image processing and analysis software was developed to determine the morphological parameters and color of pomegranate seeds. For this purpose, first the seeds were pretreated.
Skin and other impurities were separated from pomegranate seeds, and the seeds were then washed with water and air-dried.
The images were prepared using a flatbed scanner (HP ScanJet G4010, Hewlett Packard Co., CA, USA) with resolution of 600 dpi and the following settings: highlight 190, shadows 40, and midtones 1 (scanning software HP Precisionscan Pro, Hewlett Packard Co.). In each image acquisition, about 70 pomegranate seeds were placed on glass plate of the scanner avoiding seed to seed contact. The seeds were then covered by a nonreflecting black surface. All images were taken to approximately fill the scanner field of view and for further analysis, the images were stored in JPEG format.

| Image processing and feature extraction
For color determination, the contrast of images' background were improved and manual segmentation were done (to extract the true images of pomegranate seeds from background) using Adobe Photoshop (Adobe, v.12.0). Since the L*a*b* color is device independent and providing consistent color regardless of the input or output (Yam & Papadakis, 2004), the preprocessed images were converted into L*a*b* units. Schematic view of color measurement for a seed of MDSiR cultivar is shown in Figure 1.
The procedure of preparing images to determine the morphological parameters was different. Figure 2 depicts a schematic view of this procedure for six seeds of a typical variety (VJG).
As the binary images usually are used for detecting the particle information, after image acquisition using ImageJ software (National Institutes Health, Bethesda, Md, USA) version 1.45e, the images were converted to binary format.
The identification of each of the pomegranate seeds were performed by segmentation. Segmentation was accomplished using Otsu algorithm in Image J. The Otsu's threshold algorithm searches for the threshold that minimizes the intraclass variance, while the manual method assigns the threshold by finding each of the summits of the histogram of frequencies and then the bezels between them (Gonzales-Barron and Butler, 2006; On-line docs ImageJ software).
In Otsu's algorithm, the optimal threshold value (t*), expressed in terms of class probability (ωi) and class mean (μi) can be obtained by a step sequence: (1) computing the probability of each intensity gray level (pi), (2) establishing the initial probabilities (ωi) and means (μi), (3) stepping through all possible thresholds (t = 1…maximum intensity) and (4) updating ωi and μi to acquire the eligible threshold (t*) which corresponds to the maximum between-class variance (Farrera-Rebollo et al., 2011).

Where
The next step was reducing the effect of noise and outliers with median filter (r = 2 pixel). Afterward, dilation as one of the two basic operators in the area of mathematical morphology, applied to the filtered images. The basic effect of the operator on a binary image is to gradually magnify the boundaries of regions of foreground pixels.
Thus, areas of foreground pixels enlarge in size while holes within those regions become smaller.
The enhanced images were acted to get a detailed explanation of the overall morphology. For each individual pomegranate seed, the acquired size parameters were Area (mean area of seeds in square pixels), Perimeter (the length of the outside boundary of the selection), Minimum Feret Diameter MFD (minimum distance between parallel tangents) and from the side view image, Shape Descriptors including Circularity (4π*area/perimeter^2), Roundness (4*area/(π*major axis^2)), and Solidity (area/convex area) (Rasband, 2006). In this study, PCA was used to reduce the dimensionality of the data. The reduced feature spaces were used for agglomerative hierarchical clustering. The analysis was performed with XLSTAT 2011 statistical package.

| Clustering
Clustering methods can be divided into two basic types: hierarchical and partitional clustering. Within each of the types there exists a wealth of subtypes and various algorithms for finding the clusters. Hierarchical clustering proceeds successively by either merging smaller clusters into larger ones, or by splitting larger clusters.
Partitional clustering, on the other hand, attempts to directly decompose the data set into a set of disjoint clusters (Rokach & Maimon, 2005). The method proposed by Ward (1963) aggregates two groups so that within-group inertia increases as little as possible to keep the clusters homogeneous. In this study, based on the nature of data, this aggregation criterion having the least susceptibility to noise and outliers was applied. Ward's distance (D w ) between clusters C i and C j is the difference between the total within-cluster sum of squares for the two clusters separately, and the withincluster sum of squares resulting from merging the two clusters in cluster C ij : where r i , r j , and r ij are the centroids of C i , C j , and C ij , respectively.

| PCA outcomes
To achieve satisfactory results in a statistical multivariate analysis, the selection of variables should be carefully considered, so that only relevant variables must be included in the analysis. The results of the PCA for image features and physicochemical properties are presented in Table 1. The analysis demonstrates that 40.09% of the total

| Principal components loading
Principal components loading (eigenvectors) and correlations between variables and PCs of image features and physicochemical properties are shown in Tables 2 and 3, respectively. Also, six PC scores were calculated as linear combinations of measured physicochemical properties.

| PC indicators
The other alternative to PC scores is that the most correlated measured variable be selected to represent PCs (PC indicator). This is computationally attractive, as there is no need to extract all the

T A B L E 2 Eigenvectors (EV) and correlations (R) between variables and PCs of image features
variables. Only the selected variables can be extracted (Chandraratne et al., 2006). The four image features selected for PC indicator are: Circularity, Minimum Feret Diameter, L*, and a* parameters.
Meanwhile, the six physicochemical properties selected for PC indicator are: fruit diameter, % aril/fruit, juice density, total sugars, total phenolics, and pH.

| Clustering results and overlapping of them
All variables, PC scores, and PC indicators were used for clustering.
Results of clustering based on different variables and the cultivars exposure in each cluster are shown in Table 4.
The maximum cultivars in one cluster are 11, and each cluster at least contains four cultivars. In order to evaluate how much image-  In these two dendrograms (Figures 3 and 4) it could be seen that how the algorithm of AHC progressively grouped the different pomegranate seed cultivars based on PC indicators of their physicochemical properties ( Figure 3) and also all variables of the image features ( Figure 4).

| CONCLUSIONS
In this work, in order to cluster 20 cultivars of pomegranate seed, 9 image features and 21 physicochemical properties of them were extracted.