Lemon-YOLO: An efﬁcient object detection method for lemons in the natural environment

Efﬁcient Intelligent detection is a key technology in automatic harvesting robots. However, citrus detection is still a challenging task because of varying illumination, random occlusion and colour similarity between fruits and leaves in natural conditions. In this paper, a detection method called Lemon-YOLO (L-YOLO) is proposed to improve the accuracy and real-time performance of lemon detection in the natural environment. The SE_ResGNet34 network is designed to replace DarkNet53 network in YOLOv3 algorithm as a new backbone of feature extraction. It can enhance the propagation of features, and needs less parameter, which helps to achieve higher accuracy and speed. Moreover, the SE_ResNet module is added to the detection block, to improve the quality of representations produced from the network by strengthening the convolutional features of channels. The experimental results show that the proposed L-YOLO has an average accuracy(AP) of 96.28% and a detection speed of 106 frames per second (FPS) on the lemon test set, which is 5.68% and 28 FPS higher than the YOLOv3, respectively. The results indicate that the L-YOLO method has superior detection performance. It can recognize and locate lemons in the natural environment more efﬁciently, providing technical support for the machine’s picking lemon and other fruits.


INTRODUCTION
Picking fruits and vegetables is a crucial part in agricultural production. However, most of it still relies on manual work currently, which consumes plenty of manpower and time, resulting in high production costs [1]. To achieve auto-picking by robots, detecting objects precisely is the primary issue. Comparing with manual lemon picking, it is of great application value and practical significance to study lemon detection in the natural environment by machine vision to promote the development of picking fruits by robots. Fruit detection is the basis of automatic picking. Many researchers have proposed some detection algorithms based on traditional image recognition methods, which have fulfilled the task of recognizing the fruits in the natural environment. Kurtulmus et al. [2] used colour, circular Gabor texture analysis and a novel 'eigenfruit' approach to detect green citrus fruits. 75.3% of the actual fruits in validation set were successfully detected by this method. Xu et al. [3] proposed a new method based on a histogram of oriented gradients (HOG) descriptor associated with a support vector machine (SVM) classifier for strawberries. The accuracy achieved was 87%. The proposed classifier could only appropriately detect slightly overlapping strawberries. Zhao et al. [4] put forward a multiple feature image fusion method to recognize tomatoes. The wavelet transformation was applied to fuse the feature information of a*-component and I-component extracted from different colour space and luminance. The optimal threshold got by an adaptive threshold algorithm was used to distinguish the tomatoes from the background. This method obtained an accuracy of 93% without considering the affection of the illumination. Okamoto et al. [5] proposed a pixel discrimination function to distinguish fruits from other objects in a hyperspectral image and applied spatial image processing steps to the segmented images to detect green citrus fruits. 80-89% of the fruits were recognized correctly, but numerous highly overlapped fruits were identified incorrectly.
The traditional algorithms of fruit images recognition or detection mentioned above mainly are based on handcrafted features, such as colour, texture and shape, to fulfil the task of detection. These features can be easily changed by some typical interfering factors in natural conditions, such as fluctuating illumination, uneven brightness, overlapping fruits, occlusion of branches and leaves and so on, resulting in lower detection precision and generalization. Therefore, they cannot meet the needs of efficient picking fruits by robots well.
With the rapid development of deep learning technology, various mainstream object detectors are proposed, which mainly include two categories: one-stage object detectors and two-stage object detectors. The main representative one-stage object detectors including SSD [6], RetinaNet [7], YOLOv3 [8], YOLOv4 [9] and EfficentDet [10] are developed continuously. As for two-stage object detectors, the famous models proposed successively are Fast R-CNN [11], Faster R-CNN [12], FPN [13], Cascade R-CNN [14], Libra R-CNN [15]. Moreover, the anchor-free one-stage detectors, such as CornerNet [16], CenterNet [17] and FCOS [18], have been received attention in recent years. Some of the state-of-the-art object detectors are used in fruits detection recently. They extract features from images for recognizing and locating objects by convolutional neural network (CNN), and have excellent detection precision or speed, which have solved some defects in traditional methods. Xiong et al. [19] used the Faster R-CNN model to detect the green citrus on trees. The mean average accuracy (mAP) of the test dataset was 85.49%. The average computing time of detecting a single image was 0.4 s. Bi et al. [20] designed a visual feature recognition model based on deep convolution neural network (DCNN) to detect citrus in the natural environment. A DCNN structure was used to extract the features of citrus, which was a deep pool structure to extract high-level semantic features in order to get citrus feature maps The AP and average detection time of the model was 86.6% and 0.08 s, respectively. Peng et al. [21] adopted an improved SSD model to detect apples, litchis, navel oranges, and Huangdi gan in different environments. The mAP reached 89.53% and the average detection time was about 0.05 s when the overlapping area was lower than 50%. Tian et al. [22] proposed the YOLOV3-dense model based on YOLOv3 to detect apples during different growth stages in orchards. The DenseNet was used to process feature layers with low resolution in the YOLOv3 network. The F1 score and average detection time were 0.817 and 0.304 s per frame at a 3000 × 3000 resolution. Xiong et al. [23] modified the YOLOv3 model by using ResNet and DenseNet for reference and applied Dens-YOLOv3 model to detect mature citrus in the complex field environment at night, achieving the mAP of 90.75% and costing only 0.019 s on average to detect an image with a resolution of 1920 × 2080 pixels. Its detection speed was 53 FPS. However, the detection accuracy or speed of these fruit detection methods based on deep learning still requires some improvements for field picking robots with high working efficiency.
Considering that YOLOv3 [8] has been widely applied to practice for its good balance between accuracy and speed, the L-YOLO algorithm on the basis of YOLOv3 is proposed in this study. Unlike YOLOv4 [9] which used various state-ofthe-art bag of freebies and bag of specials methods based on YOLOv3 to improve the detection performance, less convolutional layers and strategies are adopted on L-YOLO to recognize and locate lemons on trees in the natural environment faster and more accurately, and realize efficient picking lemon fruits by robots. The main contributions of this study are as follows: (i) the Squeeze-and-Excitation Network (SENet) [24] and ResNeXt block [25] are used to construct a new network termed SE_ResGNet34 which replaces the DarkNet53 network in YOLOv3 to extract richer lemon features in images; (ii) the first four convolutional layers using standard convolutional filters with 1 × 1 and 3 × 3 in YOLOv3 detection block are modified to SE_ResNet module to enhance the convolutional features of channels. These improvements can help the model achieve high detection accuracy and fast real-time detection speed.
The remainder of this paper is organized as follows: Section 2 describes lemon dataset and proposes the L-YOLO method in detail. Section 3 describes experimental conditions and evaluation indicators, demonstrates and discusses the experimental results, using different methods and datasets. Section 4 provides conclusions drawn from the results.

Image acquisition, augmentation and annotation
In this study, the lemon images are collected during different periods of sunny and cloudy weather conditions. The orchards are located in Guigang, Guangxi, China. During image acquisition, the camera is 0.5-1.5 m away from the lemon fruits in different viewing directions. 800 images of lemons are collected from orchards. All the images are taken under natural conditions, including long range, close range, front-lighting, backlighting, random occlusion and so on. These 800 images are then expanded to 3850 images by using some image processing methods, such as horizontal mirror flipping, random brightness adjustment, random small clipping and fixed scaling, to avoid over-fitting in the neural network due to the small number of samplers. Figure 1 shows some samples from the lemon dataset under different conditions.
The object detection models with supervised learning usually require that the input dataset contains the information of object images, object category labels and its location coordinates. Therefore, it is necessary to manually draw bounding boxes in the lemon images to locate the lemon fruits and label their categories. In this study, LabelImg tool is used to label lemon images. It saves the information of the images' resolution, the coordinates of pixels at upper-left and lower-right corners of the bounding box and the category of the fruits to XML files. Figure 2 shows some examples of annotation on lemon images.

The improved YOLOv3 model
To make the picking robots work more efficiently, the object detection methods for picking fruits should have high recognition accuracy as well as fast real-time detection speed. During the experiment of lemon on trees detection, it is found that the YOLOv3 model uses DarkNet53 as the backbone network for feature extraction, which has too many convolution layers to enlarge the computation. It leads to loss of some important feature information, affecting the accuracy and speed of the detection. To overcome the above shortcomings, the L-YOLO model is proposed, referring to the ideas of SENet [24] and ResNeXt [25]. On the basis of the YOLOv3 model, SE_ResGNet34 is designed as a new backbone network with less computation and higher ability of feature extraction. Furthermore, SE_ResNet module is proposed to enhance the features in channels, to

SE_ResGNet34 network
Squeeze-and-Excitation network (SENet) is a new convolutional neural network structure for visual tasks, proposed by Hu et al. [24]. It is consisted of convolutions and Squeezeand-Excitation (SE) block, which automatically learns to get informative features and suppresses useless ones, promoting improvement of model performance. The SE block used in this study is shown in Figure 3. The feature maps called U are first passed through Global Pooling called a squeeze operation, which squeezes global spatial information into a channel descriptor by shrinking U through its spatial dimensions H × W. The aggregation is followed by two fully-connected (FC) layers around a non-linearity called an excitation operation, which enhances the useful information of features on channels. The first FC with a ReLU [26] function reduces channel dimensions to C/16; the last FC with a sigmoid activation normalizes feature values and restores channel dimensions to C. The scale operation denotes channel-wise multiplication between the feature maps called Sc and U to recalibrate features. It can learn to , which is used more widely. A layer is shown as (input channels, filter size, output channels) enhance informative features and reduce irrelevant ones selectively by using the global information [24]. Xie et al. [25] proposed the ResNeXt module instead of residual module in ResNet [27] network, which improved classification accuracy without increasing the network parameters significantly. ResNeXt module mainly uses grouped convolutions with the same groups in residual module, as shown in Figure 4. The low-dimensional input feature maps are divided into different groups essentially on input channels by the grouped convolutional layers. The grouped convolutional layer in Figure 4(c) implements 32 groups of convolutions, whose input channels are 8D and output channels are 16D. The convolutions are concatenated by the grouped convolutional layer as the output of the layer.
The designed SE_ResGNet34 network based on improved ResNet34 [27] is mainly composed by 3 × 3 convolution layers, maximum pooling layer, SE_Res blocks and SE_ResX block. Among them, SE_Res blocks and SE_ResX block is embedded SE block in Residual block and ResNeXt block respectively, which is different from DarkNet53 which only uses convolution layers and Residual blocks. Therefore, fewer convolution layers are used in the backbone, while the informative features are enhanced in the network. A specification of SE_ResGNet34 structure used in this study is presented in Figure 5 (left).

2.2.2
The improved YOLOv3 detection block Twelve convolution layers composed of six 1 × 1 convolution layers and six 3 × 3 convolution layers were used in the original YOLOv3 detection block [8]. For object detection with single class, the number of convolution layers is redundant, which increases the computational complexity and slows down the detection speed. Therefore, these convolution layers are replaced by three SE_ResNet modules, each of which consists of a 1 × 1 convolution layer, a 3 × 3 convolution layer, a SE block and a shortcut connection. The improved YOLOv3 detection block not only decreases the number of parameters, but also further enhances the channel features of lemon in images, so that it can improve the detection precision and speed.

L-YOLO network
The proposed L-YOLO network is shown in Figure 5. During the training stage, the L-YOLO model establishes the loss function [28] by combining the information of predicted bounding boxes output from the network and the information of ground truth boxes, and minimizes the loss function through training. In the test stage, the non-maximum suppression (NMS) [29] is used to select the predicted bounding boxes, ignoring the boxes whose scores are lower than the threshold of category scores and the overlaps are higher than the threshold. Finally, it outputs the predicted bounding boxes containing the object category and location, so as to accomplish the task of lemon on trees recognition and location.

Experimental setup
In this study, the experiments are conducted on the deep learning development framework of PaddlePadle1.7.0 in Python3.7.3 environment. The computer used in this study has Intel i3-8100, 8 GB RAM, 64-bit 3.6 GHz quad-core CPU and a NVIDIA Tesla V100 GPU, running on Windows 10 professional operating system.
In the experiments, 3850 lemon images in the natural environment are used, of which 3200 are randomly selected as training set (including 4885 lemon fruits), and the remaining 650 images are as the test set (including 1152 lemon fruits). In the training stage, the momentum and weight decay are set to 0.9 and 0.0005 respectively; the total training iterations are set to 13,000 steps with an initial learning rate of 0.001, which is then divided by 10 after 8000 steps and 10,000 steps; the batch size is set to eight. The model randomly selects a value from the multi-scale set {320, 352 …, 704} as the input image size in each training iteration, which makes enhance detection performance of the model in different input resolutions. The average accuracy on test set reaches the highest when the input image size is adjusted to 704 × 704 pixels. The detection models are trained based on the pre-trained ImageNet [30] classification models and fine-tuned to the lemon/grape/Pascal VOC datasets. In the test stage, the input image is set to 704 × 704 pixels, the IOU threshold is set to 0.5, and the category score threshold is set to 0.05.

Evaluation metrics
In this study, frames per second (FPS) and average precision (AP) are used as the performance metrics for evaluating the object detection models. The calculation method of AP is related to the precision (P) and recall (R). Precision (P) and recall (R) are defined as follows: where TP, PB and GT are abbreviations for true positives (correct detection), predicted bounding boxes, and ground truth boxes. Average precision (AP) is the area under the precision-recall (P-R) curve. The calculation method is as follows [31]: where N refers to the number of recall with unequal values; r refers to the maximum recall in the set {R i+1 , R i+1 ,…, R i+N }.

Training process and analysis
The flowchart of training process of the L-YOLO model is shown in Figure 6.  [8].
The loss function is established by combining the ground truth values with the predicted values. The total loss function is equal to the summation value of the three loss functions, constructed by the output of feature maps P0, P1 and P2. The training process of the model starts with minimizing the loss function. The loss function used in the L-YOLO model is defined as follows:

FIGURE 7
The curve of training loss function where, s 2 refers to the number of grids in the input image, s ∈{m/32, m/16, m/8}. 1 obj ij = 1 if the object appears in the jth bounding box predictor in grid i, otherwise 1 obj ij = 0. (x i , y i , w i , h i , P oi , P ci ) are the values of the centre coordinates, height, width, object probability, and class probability of the ground truth box. (x i ,ŷ i ,ŵ i ,ĥ i ,P oi ,P ci ) are the values of the centre coordinates, height, width, object probability, and class probability of the predicted bounding box. Figure 7 shows the curve of the total loss function of L-YOLO model during training. The loss function gradually decreases with the increase of the training iterations. It drops sharply before 2000 training iterations, and starts to become smooth after 8000 iterations. When the number of training iterations exceeds 10,000, it tends to be stable, and finally stabilizes at about 2.6, which indicates that the training of the model can reach a convergence state and the L-YOLO network has a proper design.

Image classification
In this study, the proposed SE_ResGNet34 network and Dark-Net53 network [8] are used for image classification experiment on a tiny-ImageNet200 dataset (200 categories of images are randomly selected in ImageNet dataset [30]) with PaddlePaddle framework. The test results are shown in Table 1. When the input images are adjusted to 320 × 320 pixels, the top-5 accuracy, the average time of classifying an image and the model size of SE_ResGNet34 are 85.43%, 3.29 ms, and 83 MB, respectively, which are obviously superior to that of DarkNet53. Therefore, the SE_ResGNet34 network is used as the new back- bone network of YOLOv3 to extract richer features of lemon and to improve the detection accuracy and speed of the model at the same time.

Lemon detection on different modifications
In the lemon test set, the effects of the SE_ResGNet34 architecture and the added SE_ResNet modules in detection block are studied. The results are shown in Table 2. It shows that using the SE_ResGNet34 in the backbone network of YOLOv3 model improves average accuracy (AP) by nearly 1%, and increases the detection speed from 78 FPS to 111 FPS, comparing with using DarkNet53 network. It demonstrates that the new backbone has more powerful ability to extract lemon features. In addition, by replacing some convolution layers with SE_ResNet modules in detection block, the AP improves from 91.59% to 94.65 % and the model still remains high real-time detection simultaneously, which verifies the effectiveness of the improvement. When the input image size is set to 608 × 608 pixels in the test, the proposed L-YOLO model outperforms the original YOLOv3 model by 4% on AP, and by 42 FPS on detection speed. The performance of the L-YOLO is further improved and obviously better than YOLOv3 in accuracy and speed by resizing the image size to 704 × 704 pixels. Therefore, the proposed L-YOLO in this study can not only improve the detection accuracy, but also has faster real-time detection, which can help to accomplish a high-efficiency detection task of lemon fruits.
Some detection results of the proposed L-YOLO model are shown in Figure 8. In addition to several tiny lemons taken from a far distance shown as Figure 8(a), a blurred small lemon in Figure 8(e) and other severely occluded lemons, the proposed method can accurately identify and locate lemons in images taken under different distances, varying illumination, general occlusion and moderate overlapping conditions.

Performance of different detection algorithms
To emphasize the detection performance of the L-YOLO, the proposed L-YOLO is compared with the state-of-the-art detection methods of SSD [6], RetinaNet [7], EfficientDet-D0 [10], YOLOv4 [9], Faster R-CNN [12], Cascade R-CNN [14] and FCOS [18]. The AP and the average detection speed of different methods are shown in Table 3. Results marked by "-" represent the size of the input image range around 800 2 -1333 2 , which  is consistent with the official code. It shows that the proposed L-YOLO has the best detection performance among all of the methods. Its AP reaches 96.28% with an adjusted image size of 704 × 704 pixels, which is higher than that of other mod- els. In terms of detection speed, it achieves a detection speed of 106 FPS, which is more than five times faster than that of Reti-naNet, Faster R-CNN, Cascade R-CNN, FCOS, and also higher than the others. The results indicate the superiorities of the proposed method. Figure 9 illustrates the detected results of the state-of-the-art detection models and the L-YOLO for the randomly selected image in lemon dataset. Figure 9(a,d,e) demonstrates that SSD, YOLOv4 and Faster R-CNN appear missing detection of the lemons which are severely occluded by leaves. Figure 9(c,e) shows that EfficientDet-D0 and Faster R-CNN have a redundant detection box. Cascade R-CNN cannot detect the lemons under severe occlusion and overlapping. Moreover, it locates two lemons as one lemon as shown in the Figure 9(f). Therefore, SSD, YOLOv4, EfficientDet-D0, Faster R-CNN and Cascade R-CNN have missing or false detection. RetinaNet, FCOS and L-YOLO can correctly recognize and locate the largest same number of lemons as shown in Figure 9(b,g,h), but the mean detection confidence scores of L-YOLO is 0.76, which is 0.05 and 0.26 higher than that of RetinaNet and FCOS respectively.

3.4.4
Object detection on grapes Different detection models are used to train and test on the publicly available grape dataset termed WGISD (including 300 field grape images with 4432 objects) [32] to further validate the feasibility and effectiveness of the proposed L-YOLO model under the same experimental conditions. The results are shown in Table 4. The WGISD dataset contains five kinds of grapes. Compared with the lemon dataset made in this study, each bunch of grapes has great differences in shape, size, colour and structure. Most of the grapes in the images with resolution of 2048 × 1365 pixels (larger than lemon images) are smaller, which results in the accuracy and speed of detection lower than that of lemon detection. However, the AP and the detection speed of the proposed L-YOLO model reaches 87.90% and 66 FPS, respectively, which is superior to other models, indicating that the L-YOLO model has wider application.

3.4.5
Failure detection cases on lemons or grapes Figure 10 demonstrates the failure detection cases of some lemon or grape images with the proposed L-YOLO model. The leaf in the top right corner of Figure 10(a) is similar to the green lemons in colour, and the leaf in the top left corner of Figure 10(b) is nearly the same as deep green lemons in shape and colour, so that they are detected as lemons. Figure 10(c) shows the L-YOLO has weakness in detecting small lemons with severe occlusion. Part of branches and leaves are detected as grapes as shown in Figure 10(d,e). Due to the irregular shape and close proximity of grapes, some redundant detection boxes appear in Figure 10(f), which leads to inaccurate grape location. Test results on Pascal VOC We evaluate the models on the Pascal VOC dataset which consists of 16551 train images (VOC2007+2012) and 4952 test images (VOC2007) over 20 object categories. The performance comparison of different detection models on the test set is shown in Figures 11 and 12. Figure 11 demonstrates the proposed L-YOLO model has a certain advantage in parameters and performs fastest in detection speed, compared with other state-of-the-art models. Its parameters are 26.09 M, which is second only to EfficientDet-D0, more than twice less than RetinaNet, YOLOv3, YOLOv4, Cascade R-CNN, FCOS, and simultaneously lower than SSD and Faster R-CNN. The detection speed of the L-YOLO is 98 FPS, which is higher than other models, especially more than four times faster than RetinaNet, Faster R-CNN, Cascade R-CNN, FCOS, and twice faster than EfficientDet-D0 and YOLOv4. Figure 12 shows the accuracy of different detection models on Pascal VOC2007 test set. The results indicate that the accuracy of proposed L-YOLO model is better than most of the state-of-the-art models.
Multi-object dataset contains more object features, so its detection needs the model to have stronger feature extraction ability. Compared with the proposed L-YOLO, YOLOv4 which uses various state-of-the-art bag of freebies and bag of specials methods based on YOLOv3 has deeper convolution layers and more complex structure to enhance the ability of multi-class feature extraction, thus it has higher detection accuracy on the

FIGURE 12
The accuracy of different detection models on Pascal VOC2007 test (a) the AP of different detection models on each category of Pascal VOC2007 test, (b) the mAP of different detection models on Pascal VOC2007 test Pascal VOC2007 test. However, it is obvious that the L-YOLO outperforms YOLOv4 in parameters and detection speed as shown in Figure 11. Therefore, the proposed L-YOLO model trades off the parameters, detection speed and accuracy better, indicating that it is also suitable for multi-class object detection, like some animals, public transport, furniture, etc.

CONCLUSIONS
In this study, an L-YOLO detector is proposed for lemon detection in the natural environment based on YOLOv3 model. The SE block and ResNeXt block are used to modify ResNet34 network to design a new backbone termed SE_ResGNet34, replacing DarkNet53 to extract features. It enhances the propagation of informative features and reduces the size of the model, which improves the detection accuracy and accelerates the speed of object location. In the YOLOv3 detection block, some convolution layers are reduced and the SE_ResNet modules are added to obtain more features of lemon, increasing the accuracy of lemon recognition while remaining high real-time. The proposed L-YOLO model achieves an average accuracy of 96.28%, and its detection speed reaches 90 FPS in lemon test set, which is more effective than the state-of-the-art detection methods of SSD, RetinaNet, EfficientDet-D0, YOLOv3, YOLOv4, Faster R-CNN, Cascade R-CNN and FCOS. Furthermore, several comparative experiments of different object detection methods are designed and analysed on the grape dataset and Pascal VOC dataset. The grape test results show that the L-YOLO method in this study has better accuracy and faster speed for grape detection in the field, compared with other state-of-the-art methods. And the Pascal VOC2007 test results show that the L-YOLO is superior to most detection models.
The results of the present study indicate that the proposed L-YOLO method has wider applicability and stronger generalization ability. It can detect the lemons in the natural environment more efficiently, which provides a new method for the realization of high-efficiency fruit picking by robots.
In future work, the study will be focused on improving the performance of tiny object detection, especially on the object with severe overlap and occlusion.