Smart agriculture: real-time classification of green coffee beans by using a convolutional neural network

: Coffee is an important economic crop and one of the most popular beverages worldwide. The rise of speciality coffees has changed people's standards regarding coffee quality. However, green coffee beans are often mixed with impurities and unpleasant beans. Therefore, this study aimed to solve the problem of time-consuming and labour-intensive manual selection of coffee beans for speciality coffee products. The second objective of the authors’ study was to develop an automatic coffee bean picking system. They first used image processing and data augmentation technologies to deal with the data. They then used deep learning of the convolutional neural network to analyse the image information. Finally, they applied the training model to connect an IP camera for recognition. They successfully divided good and bad beans. The false-positive rate was 0.1007, and the overall coffee bean recognition rate was 93%.


Introduction
Coffee is an important economic crop and one of the most popular beverages in the human society [1]. Coffee is cultivated in over 70 countries, primarily in the equatorial regions of America, Southeast Asia, the Indian subcontinent, and Africa. Green coffee beans are the most traded agricultural products in the world. With the rapid increase in speciality coffee retailers and cafes in the 1990s, speciality coffee became one of the fastest-growing markets for the foodservice industry. Many countries have developed their own speciality coffee associations. According to the Specialty Coffee Association of America (SCAA), a cup of speciality coffee is not defined as a cup of coffee that has been brewed and sent to consumers. Instead, it emphasises the whole process of producing the cup of coffee. However, green coffee beans are often mixed with impurities and unpleasant beans [2]. If these impurities and cowpeas are not manually picked out before roasting, the overall coffee quality and flavour are affected.
In the present, the field of artificial intelligence applications can be divided into three categories: speech recognition [3], image recognition, and natural language processing [4]. Image recognition technology can be applied in areas such as smart cities, medical care, and agriculture. In smart cities [5], traffic flow analysis can be used to improve traffic congestion and reduce traffic accidents. Medical imaging [6] through MRI or computed tomography can be used in the early detection of disease roots and in treatment. Agricultural imaging [7] can be used for the identification of crop pests, which can reduce the loss of crops during plantation and increase crop yields. Therefore, it is hoped that image recognition technology can be applied for identifying green coffee beans, thereby improving the quality and flavour of coffee.
The definition of defect beans [8] provided by the SCAA is presented in Tables 1 and 2. There are two types of defects: primary defects and secondary defects. Through the definitions provided by the SCAA, we can determine what type of green coffee beans are defective beans. If bad beans are roasted together with good beans, the coffee does not taste like speciality coffee. To solve this problem, we must manually select green beans. However, this process involves considerable labour and time cost.
There are already plenty of different methods been proposed to classify different species of coffee beans, including artificial neural network (ANN), K nearest neighbour (KNN) [9], support vector machine (SVM) [10] and near-infrared (NIR) [11]. Most methods provide >90% of accuracy on classifying different species of coffee beans. However, few methods are proposed to detect defects on unroasted green coffee beans. Before applying deep learning, area and circumference [12] of a coffee bean, are considered as important basis while judging, and gave us about 78.5% of accuracy while inspecting coffee beans with defects. Image processing technique and threshold applying [13] can also help to inspect defects on green coffee beans with about 83% of accuracy.
Currently, many vision sorting systems [14] are available in the market. These systems can be used to distinguish good and bad varieties of items such as peanuts, seeds, rice, and green coffee beans mainly on the basis of colour. However, a range of colours must be set for these systems, which require a robotic arm to pick out the items. This method is slow and inefficient. Accordingly, we propose a method that involves using deep learning technology to determine the standard of good and bad items.
We preprocessed images of green coffee beans obtained through image processing technology by using the convolutional neural network (CNN), which is a popular technology in deep learning. CNN is good at the colour and shape extraction of images. Therefore, we can easily get features of good and bad bean images such as partial black, broken and so on. We trained the exclusive coffee bean prediction model to quickly distinguish which raw beans were good and bad. By using this method, the considerable time required for the manual selection of coffee beans can be reduced and the development of speciality coffee beans can be promoted. An automatic green coffee bean identification system was developed in this study to enhance the speed and accuracy of green coffee bean identification through artificial intelligence. The experiment results indicate that the developed tool can serve as a reference for related studies.

System architecture and implementation
We established the process flow displayed in Fig. 1. We first collected the data, preprocessed the data, augmented the data, and finally resized the image in a data format acceptable to the neural network. This entire process automatically generated the required data sets. Central server specification and each step in the process flow are explained in the following sections.

Central server
(i) Operating system: In our experiment, we use the Ubuntu 14.04 as the operating system for our central server.
(ii) Hardware: In recent years, central processing unit (CPU) performance and graphics processing unit (GPU) technology have been improved tremendously. Training time of the learning module through the GPU parallel computing technology can be greatly reduced. We use the GPU developed by Nvidia to accelerate the training time in parallel. The hardware list of our central server is shown in Table 3. Since we need to read and normalize all the coffee bean images before the GPU training process, we need larger memories to accommodate enough images and faster CPU processing speed.

Data collection
Since insufficient pictures of coffee beans were available on the Internet, we were required to buy green coffee beans from a coffee shop. Fig. 2 shows a few coffee beans we bought from the shop.
The store helped us to divide the green coffee beans into good and bad beans so that we could directly take pictures. In the past, collecting data sets was a difficult job while building a neural network. It sometimes took more time than the training itself. We considered an automatic coffee bean data collect mechanism with conveyor. In this mechanism, a vibrating bucket makes a pile of coffee beans line up in a lane and drop on the conveyor. After dumping coffee beans into the vibration bucket, the bucket will make a pile of coffee beans line up in a lane and send to the conveyor consecutively by vibrating. The vibrating bucket is shown in Fig. 3.
A high-resolution IP camera is set above the conveyor in order to take a photo while coffee beans pass through, and it is connected to a desktop computer. We have fine-tuned a pre-trained object detection model Yolov3 with coffee beans. When the objects pass through the camera, the model will recognise the object and then give a command to the camera to take a photo. With this mechanism, we are able to collect data sets automatically instead of taking photos manually. The environment is shown in Fig. 4.
We previously used a webcam to take the photos. However, we found out that when coffee beans moved over a certain speed on   the conveyor, motion blur was observed on the photos we took. It led to severe accuracy drop in this system. Therefore, we switched the camera from an ordinary web camera to a high-resolution IP camera. With this new camera, we are able to adjust the exposure time, white balance, and other camera parameters manually. This new camera is connected to the computing unit by RJ45 ethernet port. The parameters we set for the IP camera are shown in Table 4. Different from the original web camera, the exposure time of the new IP camera can be set up to 1/10,000 s. With lower exposure time and sufficient light source, we managed to take photos of an object moving in high-speed on the conveyor without motion blur. Values of the camera parameters are presented in Table 4. These data sets will then be cropped and resized to a reasonable resolution in order to become valid data sets for our training. The images that we have shown below are all in relatively low resolution in that training the model with high-resolution photos requires more powerful GPU and much more time. We decided to resize the photos of coffee beans but still retained most of the features on the appearance.
We photographed 1000 good beans and 1000 bad beans, as displayed in Fig. 5.

Data preprocessing
(i) Image segmentation: To reduce the burden and time spent cutting images manually, we cut the coffee beans in the raw data through program automation, converted the raw data into greyscale, and assigned threshold values. Finally, we obtained the precise position of the green coffee beans [15]. (ii) Image background removal: We endeavoured to reduce the interference from the background on the training model. Although the background seemed to be black, the value of RGB was still between 0 and 255, which is not absolutely black, so we still needed to remove the background of the green coffee beans. We used colour detection methods to remove the background [16].
(a) Colour detection: The advantage of colour detection is that a mask can be generated through two bounds, namely an upper bound and a lower bound. By using the mask, we can remove the background value of the range from the image. Therefore, we were required to use ultra-black material for our background. The background could be easily removed from the image. Fig. 6 illustrates the bean images obtained after the background was removed. The left image displays a good bean, whereas the right image displays a bad bean.

Data augmentation
At the beginning of the data collection process, data of good and bad beans were insufficient. Therefore, data of good and bad beans were set at 1000. However, insufficient training data could cause overfitting for the CNN model. Data augmentation [17] technology provides methods for flipping, offsetting, cutting, enlarging, and shrinking the images to enhance the original data set. We did not wish to change the size, shape, and colour of the original image. Therefore, we only used the rotation and flip operations to enhance the collected data set.
As depicted in Fig. 7, we enhanced the data set by rotating the centre of each coffee bean by 40° and collected the flipped images. We originally rotated the images by 30°. However, the experiment indicated that when the angle was 180°, the flipped and rotated pictures produced duplicate images. Finally, as presented in Table 4, we obtained a data set nine times larger than the original data set by rotating the images of 1000 good and bad beans. Through flipping, we obtained a data set four times larger than the original data set. Finally, we obtained a data set 36 times larger than our training and testing data sets.

Image resizing
After the rotation and flip data were enhanced, we resized the data set to a fixed length and width image and used this image as the training material of the CNN model. In data preprocessing, we resized the image to width and length of 180 pixels each. The detail of the methods used are as follows: (i) Resizing: The maximum length and width of the current image were counted, and a square with a black background was created. The original image was combined with a black background and finally resized to 180 × 180 pixels.
We used a method that only required the knowledge of the length and width of the current picture. Therefore, the proposed method was suitable for dealing with video stream preprocessing. The length and width of all coffee beans cannot be obtained in advance when the video is streamed. Therefore, this resizing method was suitable for our implementation technique.

CNN model architecture
In this implementation, we used CNN models for the greyscale coffee bean image training (Table 5). Through compressing image's third dimension into one dimension, we can obtain a greyscale image. The greyscale image enabled easy detection of the shape of the green coffee beans and their dark colours.
In the CNN model architecture, we mainly used the rectified linear unit (ReLU) activation function for the convolution layer. The ReLU applies the non-saturating activation function [18], as indicated in (1). It effectively removes negative values from an activation map by setting them to zero. It increases the non-linear properties of the decision function and overall network without affecting the receptive fields of the convolution layer. Other functions, such as the saturating hyperbolic tangent (2) and sigmoid function [19] (3), are also used to increase non-linearity. The ReLU is often preferred to other functions because it trains the neural network several times faster than other functions without a significant decrease in generalisation accuracy σ(x) = (1 + e −x ) −1 .

Recognition mechanism
Streaming with high resolution and frame rate is a heavy work load for the recognition model. As a result, we decided to apply a fine-tuned Yolov3 object detection model to our system. When the object on the conveyor passes through the camera, the object detection model will give out a command to the camera to take a photo. Coffee beans classification model will then judge the bean based on this photo, and give instructions to the air gun if the coffee bean was judged as bad bean.

Experiments and results
In the experiment, we prepared 72,000 images through data augmentation. Of the 72,000 images, 36,000 images were good beans and 36,000 images were bad beans. In total, 7000 images of good and bad beans each were selected randomly from the augmented data. The remaining data were used as the training data. The number of training and testing data sets are shown in Table 6.

Google object detection Application Programming Interface (API)
The Google object detection API provides many training models such as faster R-CNN, single shot MultiBox detector, R-FCN.
Since faster R-CNN has higher accuracy than other models, we use faster R-CNN as our model choice. We use the faster R-CNN with inception v2 coco provided in the API as a pre-training model. It gives a fair trade-off between accuracy and speed for GPU accelerated environments. Before training the model, we need to prepare the data set and label. We need to first convert the image and the label into a TFRecord format (see in Fig. 8). The TFRecord format is a binary file format recommended by Google that can hold information in any format, and Tensorflow also provides a rich API to help us read and write TFRecord files. After preparing the data set and label, we need to train the model. We apply the trained model to the identification. In Fig. 9, this is the result we have identified through the trained model. We can see that the picture is full of various boxes. In the identification, it is impossible to determine the correctness and quality of the coffee beans accurately. We have read many papers using this method on the Internet, and many people have encountered such problems. They suggested that we can increase the number of steps to improve the problem. However, the problem still cannot be solved after trying this suggestion.
Finally, we found that most of the object detection applications are applied to the detection of large objects such as human identification, car identification, animal identification etc. and these objects have obvious feature differences. However, we have to be very close when we pick green coffee beans. This will allow us to observe the coffee beans in more detail. Therefore, the application of coffee bean identification is more difficult to complete by using Google object detection API. Although, Google object detection API is convenient and easy to use.

Object detection based on the CNN
In the designed CNN model architecture, the preprocessed data were used as the training data. Originally, we used the RGB colour space to perform data preprocessing before training. However, when the coffee beans were of different colours, correctly identifying the good and bad beans became difficult. Therefore, we used greyscale images to preprocess the data. The shape of the greyscale image could be detected easily. The training results are illustrated in Figs. 10 and 11. Fig. 10 displays the line chart of the training and testing accuracies under ten epochs. The training and testing accuracies increased steadily. The final testing accuracy was ∼94.68%. Fig. 8 also displays the line chart of the loss for training and testing. Loss in both testing and training gradually decreased. In the late stages of training, the loss was smaller than those for testing. The final testing loss was ∼0.14. Fig. 10 depicts the training result, whose confusion matrix is presented in Table 7. In the testing data, 435 of the 7000 good beans were misidentified as bad beans and 309 of the 7000 bad  beans were misidentified as good beans. The less the bad beans, the better is the quality of the speciality coffee. The measure of interest in this study was the false positive rate (FPR) [20] (4). In this formula, false positive (FP) represents how many bad beans will be predicted as good beans, true negative (TN) represents how many bad beans will be predicted as bad beans. This index indicates the proportion of bad beans misjudged as good beans. The lower the FPR was, the closer we were to the standard of speciality coffee. According to the results of our experiments, the FPR was 0.0441 when the testing data included 7000 images each of good and bad coffee beans.
Finally, we set an IP camera with the same camera parameters as those used in the data collection and connected it to our training model. We detected each frame from the video stream and cut the coffee bean images by using image segmentation. Then, through the data preprocessing method described in Section 2, the captured bean images were cut and imported into our training model for identification. Fig. 12 displays a screenshot of the output at the time of recognition. The green frames in the figure were predicted as good beans, whereas the red frames were predicted as bad beans. Users can select the quality of coffee beans accurately and efficiently through the instant identification analysis of images. In our current test, we could identify the green coffee bean in 1 ms when the video stream reached 24 frames per second [21]. The quality of the green coffee beans could be judged smoothly and accurately.
The previous data training only ran ten epochs. To ensure that the training has reached the bottleneck and improved the overall accuracy, we added five epochs to the original training. The comparison of the result is shown in Table 8. We decreased the value of True Positive and False Positive. Increase the value of False Negative and True Negative. Lower false positive rate indicates fewer misjudgments. We reduced the chance of misjudgment of bad coffee beans, and improved the accuracy of the overall identification.

Conclusions and future work
In this study, we used greyscale images for training. Therefore, in the identification process, we also converted the images into greyscale for identification. However, greyscale images are vulnerable to ambient light because only one-dimensional image information is retained.
The CNN model was used to classify good beans and bad beans in our experiment. The overall coffee bean identification accuracy was ∼94.63% and the FPR was 0.0441. By connecting the coffee bean identification model to an IP camera, we could instantly distinguish the good and bad beans from the green coffee beans selected by the human eye. By using object detection and image recognition technology, we could reduce the time and labour costs involved and help develop the speciality coffee industry. The prototype of our system is shown in Fig. 13.
The system now can only handle one side of the coffee beans. When the defect appears on the back side of the bean, we will not be able to screen out. Since the defects on the back side of the coffee beans will also be detected as bad beans in our inspecting system currently, we will have coffee beans which are labelled as good beans repeat the inspecting process. Most of the defects appear on the back side of the coffee beans will be screened out after a few tries. Moreover, we will need to train a new inspection model which is able to detect defects on both sides of coffee beans. To achieve that, we need to collect more data that contain back side of the coffee beans. After developing the solution mentioned above, the system will be capable of screening out defects on both   sides of coffee beans; meanwhile, both side of the coffee bean will be inspected, respectively. The model currently runs on a desktop personal computer. However, considering the cost and convenience, we are planning to move the whole process to an edge computing device like NVIDIA Jetson Nano or Raspberry Pi 4 with Intel Neural Compute Stick in the future. After replacing the computing unit, the device will become easier to transport and less costly for users.
Artificial intelligence is feasible for the image recognition of green coffee beans, and it can provide accurate and efficient results. Furthermore, good and bad beans can be accurately distinguished by using a camera, which solves the problem of spending considerable time and effort for selection. In the future, we hope to connect a robotic machine to select and remove bad beans. The blueprint of the architecture of our system is illustrated in Fig. 14.
In Fig. 14, we combine our system architecture with colour sorter machine architecture. We placed the background on the track and instantly identified it through the original webcam. The brown objects are good coffee beans, and the yellow ones are bad coffee beans. We replaced the original CCD technology or infrared technology with a deep learning model and used the colour sorter machine selection method. Then, we used an air gun to separate the coffee beans. Finally, we separated the coffee beans into two containers.

Acknowledgment
This study was supported by the Ministry of Science and Technology (MOST) of Taiwan under grant MOST 107-2218-E-007-004.