Jensen–Shannon Divergence You Only Look Once: A Real‐Time Robotic Grasp Detection Network

In this article, the arbitrary‐oriented object detection problem with application in robotic grasping is addressed. A novel Jensen–Shannon divergence (JSD)– You Only Look Once (YOLO) model is proposed, which enables real‐time grasp detection with high performance. The one‐stage object detection network YOLOv5 is modified with a decoupled head, which solves the angle classification problem and rectangle parameter regression problem separately, such that the YOLOv5 network is applicable for robotic grasping and the detection accuracy is significantly improved. A circular smooth label angle classification method is proposed to tackle the boundary discontinuity problem in angle regression, and the periodicity of the angle prediction is guaranteed. A novel Jensen–Shannon intersection of union is designed to calculate the intersection over union of oriented rectangles, which aims to better measure the discrepancies between the prediction and the ground truth and to avoid the singularity problem when two rectangles are not overlapped. Extensive evaluation on the Cornell and visual manipulation relationship dataset datasets demonstrates the effectiveness of the JSD–YOLO model in general robotic grasp operations, with 99.7% and 95.7% image‐wise split accuracy, respectively.


Introduction
With the development of artificial intelligence, robotic arms have been widely used in industrial manufacturing, resource exploration, rescue, medical services, and aerospace.Autonomous grasping with robotic arms is one of the fundamental operations in these applications. [1]The robotic arm needs to estimate its grasp position accurately, which will directly affect the actual grasp operation.Traditional grasp detection methods need to analyze the kinematics and mechanics of the target, searching for grasping locations that satisfy criteria such as force closure [2] and form closure. [3] Since prior knowledge of target features is required with relatively high computational costs, traditional techniques are not suitable for widespread usage in unstructured scenes.
In recent years, as deep learning technology has evolved, the convolutional neural network (CNN) has been widely used for robotic grasp detection.CNN trains an end-to-end strategy from image to grasp policies, which can simultaneously locate the object and estimate the grasp pose.CNN has great anti-interference ability with good generalization performance, and, hence, is broadly applicable to grasp detection in unstructured scenes.
Currently, most CNN-based grasp detection approaches use a rectangle with an angle to represent the grasp configuration, where grasp angle prediction is always a challenging task.The grasp angle must satisfy properties such as periodicity and continuity.Angle prediction methods can be broadly divided into two categories: classification and regression.In general, the angle values obtained via regression are unbounded, and the loss increases dramatically on the boundaries.To tackle this issue, small-range regression of angles based on oriented anchors was performed, [4][5][6][7] while the introduction of the oriented anchor brings extra parameters.Another method is to encode the angle as components of the unit vector sin 2θ and cos 2θ, [8][9][10][11] and then solve for the angle value θ.The continuity of angle prediction was satisfied, and the values of sin 2θ and cos 2θ will not exceed the defined bounds.However, the network is unable to update the angle through the loss value since the sum of squares of the predicted sin 2θ and cos 2θ is not necessarily equal to one.Several approaches, including anchor-free grasp detector (AFGD), [12] efficient grasp detection network (EGNet), [13,14] and adaptive feature fusion and grasp-aware network (AFFGA-Net) [15] were proposed to limit the range of angle classification predictions.The grasp angle was discretized into 18 equal-length intervals, and each interval was represented using its average value, which can avoid the boundary problem in angle prediction.However, current classification loss functions, such as binary cross-entropy loss (BCELoss) [16] and focal loss, [17] cannot quantify the angular gap between the prediction and ground truth nor achieve the periodicity of the angle prediction.
Another challenge in grasp detection is the design of the loss function.A good loss function can characterize the discrepancies between prediction and ground truth, and hence enhancing the training performance and prediction accuracy of the network.Moreover, a well-designed loss function enables the network to converge to the ground truth rapidly.Yang et al. [18] pointed out that a loss function needs to be consistent with varying angle distances, aspect ratios, and center offsets.However, the majority of regression loss functions used in grasp detection methods nowadays are smooth L1 loss [4,9,14,19] and L2 norm loss, [20] which can only measure the distance.It is shown in ref. [18] that the intersection over union (IoU)-induced loss function may measure the discrepancies between the prediction and the ground truth more accurately and can improve the consistency of the loss value and the detection results.For methods that classify the grasp angles and regress on zerooriented grasp rectangles separately, for example, in AFGD, [12] the distance-IoU (DIoU) [21] loss function was computed for rectangular regression.The IoU-induced loss function calculated using zero-oriented rectangles can effectively characterize the position and aspect ratio differences between two rectangles but cannot accurately characterize the orientation differences between two rectangles.Recently, IoU for oriented rectangles has been studied.For example, SkewIoU [22] calculates the IoU of two oriented rectangles, but when the two rectangles do not intersect, the IoU equals zero and the network cannot perform parameter optimization and gradient updates.
To solve the aforementioned problems, this article proposes a Jensen-Shannon divergence (JSD)-You Only Look Once (YOLO) model for robotic grasp detection.The one-stage object detection model YOLOv5 is introduced with an angle prediction branch, and a decoupled head is designed to improve the performance of the model.A circular smooth label (CSL) angle classification method is proposed to predict the grasp angle, which satisfies the boundary continuity requirement of angle prediction.A novel Jensen-Shannon Intersection of Union (JSIoU) is proposed based on the 2D Gaussian distribution representation and JSD to calculate the IoU of oriented rectangles.The JSD-YOLO model is evaluated on two public datasets to validate the superiority of the proposed method.
The main contributions of this article are highlighted as follows.1) A modified YOLOv5 with an angle prediction branch is introduced to the robotic grasp detection task.The head of YOLOv5 is decoupled, where the angle classification and rectangle parameter regression are solved separately.The proposed model achieves high detection performance and real-time running speed with few network parameters.2) A CSL angle classification method is proposed, which not only solves the boundary discontinuity problem of angle prediction but also realizes the periodic prediction of angles.The CSL method is used to further improve the accuracy of grasp rectangles.3) A novel JSIoU is designed for oriented rectangles.The JSIoU is a Gaussian-based method for calculating the IoU of oriented rectangles in grasp detection, which improves the consistency between the prediction and the ground truth.The proposed JSIoU can also be used in various fields of oriented object detection.
The article is organized as follows.Section 2 discusses related grasp detection methods.Section 3 describes the grasp representation used in this article.Section 4 details the implementation of the JSD-YOLO method.Section 5 shows the experiment results.Finally, Section 6 concludes this article.

Related Work
The in-plane grasp detection methods can be broadly divided into two-stage and one-stage methods.[25][26] The first CNN generates grasp candidates, and the second CNN generates grasp predictions by evaluating the feature vectors of the candidate regions.In general, two-stage methods can achieve better performance by predicting potential grasp regions, but they require sliding windows to sample the images, and the long inference time limits the real-world application of these methods.
In contrast to the selective searching algorithm, which is extensively used in two-stage grasp detection, one-stage grasp detection algorithms learn the grasp strategy directly from the input, resulting in a large reduction in computational cost.For example, Redmon et al. [27] proposed a one-stage regression approach based on AlexNet [28] that detected the grasp rectangles directly on the entire image.Morrison et al. [8] proposed a generative grasp CNN (GG-CNN), and several grasp detection models, such as GR-CNN, [9] Squeeze-and Excitation Residual Network (SE-ResUNet), [10] and GGS-CNN [29] have been developed based on GG-CNN.However, GG-CNN is highly dependent on the depth information and is susceptible to interference from the environment, which requires higher demands on image collection.
To improve the performance of one-stage detection methods, an anchor box strategy was proposed in the field of object detection, which predicted the grasp rectangles using a series of anchor boxes with different sizes and aspect ratios.To fully exploit the role of anchor boxes as a priori grasp information, Zhang et al. [4] and Park et al. [20] proposed grasp detection methods based on oriented anchors.However, the oriented anchor may introduce extra anchor-related hyperparameters that need to be adjusted manually to optimize the performance for different tasks.Also, to save training and inference time, there was only one size of oriented anchor.Due to the large interval between angles, the prediction error will increase.
Recently, many one-stage anchor-free grasp detection methods have been proposed. [12,19]When assigning samples, Wu et al. [12] used a 2D Gaussian distribution to assign different weights to different regions of the positive samples, i.e., the weight of the central region is the highest, and the weight of the grasp edge is the lowest, which may effectively reduce the problems caused by ambiguous samples.Cheng et al. [19] encoded the ground truth with a 2D Gaussian distribution, followed by feature extraction and prediction to generate a grasp probability heatmap.In essence, the anchor-free method is a dense prediction method.The method is prone to many false positives due to the large solution space, and hence the prediction is not as accurate as the anchor-based method.
YOLOv5 [30] is a novel anchor-based one-stage object detection model with few parameters, high detection accuracy, and fast computation speed, and hence has been widely implemented in many fields.Song et al. [31] applied YOLOv5 to robotic grasp and pruned YOLOv5 for a lighter and more effective network model.The method mainly addressed the localization and recognition problems of objects using industrial robots with a specific wooden block target.In addition, YOLOv5 is also widely used in agricultural crop detection, [32,33] satellite component detection, [34] remote sensing image detection, [35,36] and so on.YOLOv5 has been validated to achieve good detection results and real-time detection performance in these applications.However, the original YOLOv5 generates horizontal or vertical detection boxes that cannot be used for robotic grasp detection.Therefore, in this article, the angle prediction branch is added to the YOLOv5 network, and the head is decoupled so that the angle prediction and rectangle parameter prediction are independent of each other.

Grasp Representation
The goal of the network implementation is to detect a reasonable grasp configuration for a given red/green/blue (RGB) image, and hence, the grasp configuration representation is crucial.Under the assumption of a top-down view with a top-down grasp, the inplane grasp representation can directly execute a 6 degree-offreedom grasp.There are two main representations for in-plane grasp detection, which are the grasp quality representation [8] and the 5D grasp rectangle representation. [27]n this article, the 5D grasp representation is used, which is simpler and more convenient for parallel grippers.As shown in Figure 1, the in-plane grasp configuration is fully represented by a 5D vector where the center ðx, yÞ of the oriented rectangle indicates the location of the gripper, h represents the opening distance of the gripper, w represents the width of the gripper, and θ represents the angle between horizontal axis and the moving direction of the gripper.Since the grasp angle is symmetric, the range of θ is ½0, πÞ.

JSD-YOLO Network Architecture
In this section, the proposed JSD-YOLO grasp detection model is described in detail.First, the YOLOv5 network is introduced with a modified decoupled head, where rectangular regression and angle classification are performed separately on the grasp configuration.Then, the CSL method is used to predict the grasp angle.Finally, the JSIoU is designed based on JSD to realize the IoU calculation of oriented rectangles.

YOLOv5 Grasp Network Architecture with Decoupled Head
In this article, we propose an arbitrary-oriented grasp detection method based on the YOLOv5 object detection framework.The network includes input, feature extraction (backbone), feature fusion (neck), and prediction (head).An angle classification branch is added in the prediction (head), and then the head is decoupled to predict the grasp rectangle and angle separately.
The network architecture of the JSD-YOLO is shown in Figure 2.
YOLOv5 network uses RGB images as inputs and calculates the ideal anchor for the dataset via K-means clustering.Feature extraction employs the Focus and cross-stage partial networks (CSPNet).Focus is used to reduce the dimension of the features and retain effective information to increase the receptive field.CSPNet contains the residual structure for local cross-layer fusion, which combines the feature information from different layers to obtain a richer feature map.Feature fusion combines three distinct types of feature maps obtained from the feature extraction network.To integrate the low-level information into high-level features, YOLOv5 adopts a hybrid structure of the feature pyramid network (FPN) and path aggregation network (PAN).The images are aggregated top-down by FPN with high-level features of different scales, and then the low-level features are aggregated by bottom-up PAN to obtain the feature maps that represent the intensity of features extracted by the CNN kernel and are used as inputs to the head of the network.The structure of the neck enriches and fuses different levels of image features, and improves the feature extraction ability and the learning ability of the network for various objects and scales.
In the head of YOLOv5, the angle classification and rectangular regression tasks share almost the same parameters, however, two tasks are subject to different feature requirements.The classification task is sensitive to specific regions, while the regression task is sensitive to the entire object and its boundaries.Therefore, the original head with coupled classification and regression tasks will diminish the convergence speed and performance of the model.In this article, a decoupled head is introduced into the prediction of YOLOv5, such that classification and regression tasks are solved parallelly.
The decoupled head of JSD-YOLO is similar to the one in YOLOX. [37]In the decoupled head, features are fed into two side-by-side branches, each of which has two conv-batchnorm2d-SiLu(CBS) modules.Each CBS module comprises of a 3 Â 3 convolution, a BatchNorm layer, and a SiLu activation function in series.One branch performs the angle classification task to predict the grasp angle, and the other branch performs the regression task to predict the grasp rectangle parameters and the confidence separately.Features from the classification branch are only used for the angle classification task, outputting scores for all angle classes, while features from the regression branch are used to output the offset and confidence scores for the rectangles, and hence, the accuracy of the model can be significantly improved.

Circular Smooth Label Angle Classification
Predictions of grasp angles via regression are generally unbounded, and the loss values increase sharply on the boundaries, which will affect the grasp angle prediction accuracy.The classification method has been used to limit the range of angle prediction, e.g., in ref. [14], the grasp angle prediction was solved as a classification problem by discretizing the grasp angle into 18 equal-length intervals, and each interval is represented using its average value.The boundary discontinuity problem in angle regression could be avoided, however, current classification losses, such as BCELoss, focal loss, etc., are incapable of quantifying the angular differences between prediction and ground truth.For example, as shown in Figure 3, if the ground truth is 5°, the predicted angle of 15°is closer to the ground truth angle, so its loss value should be smaller than that corresponding to 95°.However, the BCELoss values are identical when the predicted angles are 95°and 15°, indicating that this simple angle classification approach is ineffective for angle prediction.
Therefore, in this article, the CSL angle classification is introduced into the head of the network.CSL [38] employs a periodic circular label encoding approach, which avoids boundary discontinuities and complicated parameter issues, and hence has been used in oriented object detection.
As shown in Figure 4, the range of angle prediction is 180°, and CSL classifies angles into n categories.As a result, the assigned angle label interval is ð180=nÞ ∘ , where n is a design parameter.
To solve the boundary discontinuity problem, a window function is employed when calculating the angle classification loss, so that the loss value is small when it is close to the ground truth within a certain range.Therefore, the CSL is defined as where x denotes the ground truth angle value from n categories, and r is the radius of the window function, which is chosen to be 4 in this article.θ is the current ground truth angle, and Gð⋅Þ is  the window function.Since Gaussian function satisfies the properties of periodicity, symmetry, and unilateral symmetry at the same time, and its detection performance has been demonstrated in remote sensing image detection, the Gaussian function is chosen as the window function in this article.
From Figure 3, it can be seen that the BCELoss of 15°and 95°a re identical when using the simple classification method.When using CSL, however, the BCELoss corresponding to 15°is significantly smaller than the one corresponding to 95°.As can be seen, CSL can effectively calculate the distance between the predicted angle and the ground truth angle.

JSIoU
JSD-YOLO estimates the grasp rectangle parameters ðx, y, w, hÞ from the decoupled head via bounding box regression methods.There are many well-established and effective IoUs that are used to calculate the bounding box regression loss function, such as IoU, [39] generalized IoU (GIoU), [40] complete IoU (CIoU), [21] and DIoU, [21] etc. IoU-based loss functions can precisely describe the distance between two zero-oriented rectangles but cannot accurately represent the distance between two oriented rectangles.Therefore, designing the IoU loss function for two oriented rectangles can improve the consistency between loss value and detection performance.SkewIoU [22] calculates the IoU of two oriented rectangles by decomposing their intersection into multiple triangles.The SkewIoU is calculated as where Areað⋅Þ denotes the area of the region, R i , R j denotes two rectangles, and intersection denotes the intersection of two rectangles R i , R j .From Equation (3), it can be seen that when two rectangles are not intersected, Area ðintersectionÞ ¼ 0, and hence, SkewIoU ¼ 0; therefore the network is incapable of performing parameter optimization and gradient updates.To solve this problem, a novel JSIoU is proposed in this article, where a 2D Gaussian distribution is used to represent the oriented rectangles and is designed based on JSD.The JSIoU is designed as follows.
First, a 2D Gaussian distribution is used to represent the oriented rectangle.The 2D Gaussian probability density function (PDF) is calculated as where X ¼ ½x, y T $ N ðμ, ΣÞ, μ ∈ R 2 denotes the mean vector and the semi-positive definite real matrix, and Σ ∈ R 2Â2 denotes the covariance matrix of two variables.The real symmetric matrix Σ is orthogonally diagonalized and decomposed as Therefore, we have where Q is a real orthogonal matrix (denoted as the rotation matrix) and Λ is a diagonal matrix with declining eigenvalues.The PDF of Gaussian distribution is transformed into The oriented rectangle is approximated using a 2D Gaussian distribution, and the transformation between 5D coordinates and Gaussian distribution representation is depicted in Figure 5.
The mean vector μ corresponds to the center coordinate ðx, yÞ of the oriented rectangle B, and Σ contains the rotated angle θ and size information ðw, hÞ of the oriented rectangle.Q is the rotation matrix with rotated angle information.
The diagonal entries of matrix Λ represent the square of the long half-axis w=2 and the square of the short half-axis h=2 of the ellipse, respectively, i.e.
Figure 6 depicts the Gaussian distribution approximation of a rectangle.The 3D representation of the 2D Gaussian distribution is shown in Figure 6a.The projection of a 2D Gaussian distribution onto the XOY plane is shown in Figure 6a, where the red rectangle represents the original oriented rectangle.Due to the continuity of the Gaussian distribution, its value is nonzero, and hence, the intersection of two rectangles is never exactly zero.Therefore, when the predicted rectangle disjoints from the ground truth, the network is able to update continuously, so that the prediction can converge to the ground truth.
Given the 2D Gaussian representation of the oriented rectangle, the divergence between the predicted and ground truth, i.e., the distance between two Gaussian distributions, is then calculated using JSD. [41]SD is a bounded symmetrization of the Kullback-Leibler divergence. [42]The JSD of two 2D Gaussian distributions N ðμ 1 , Σ 1 Þ and N ðμ 2 , Σ 2 Þ with a skew ε is where and the matrix harmonic barycenter is Define the oriented rectangle IoU based on JSD as where f ð⋅Þ is a nonlinear function for smoothing the divergence D JS , and is chosen to be the sqrtð⋅Þ function in this article, and τ is a design parameter which is chosen to be 1.0.
To visually compare the differences between the JSIoU and SkewIoU, a combination exhibiting the possible relative positions of the ground truth and predicted rectangles is shown in Figure 7a, where for the ith row, i ¼ 1, : : : , 6, the ellipse in red is rotated by i ⋅ π=6, for the jth column, j ¼ 1, : : : , 6, the ellipse in green is rotated by j ⋅ π=6.For each combination in Figure 6, the distances between the center points of two rectangles are varied from À20 to 20 pixels in y direction, and the corresponding JSIoU and SkewIoU are plotted in Figure 7b.
It can be seen that the JSIoU proposed in this article is symmetric and one-sided monotonic.JSIoU decreases monotonically and is not exactly zero as the distance between the center points of two rectangles increases.Therefore, the proposed JSIoU outperforms SkewIoU in network training.
To improve the accuracy of the predicted height and width, a modification term is added to JSD as where w and h represent the width and height of the predicted grasp rectangle, respectively.w Ã and h Ã represent the width and height of the ground truth grasp rectangle, respectively.Thus, the modified JSD is and the modified JSIoU is JSIoU ¼ To illustrate the impact of modification term on JSIoU, a simulation of predicted rectangles and the ground truth rectangle with the corresponding JSIoU is shown in Figure 8. Figure 8a illustrates the relative positions and sizes of the predicted and ground truth rectangles.Without loss of generality, we assume that the heights of both rectangles are fixed, and the widths are varied.In Figure 8b, the JSIoUs with and without modification term are presented with different ratios of w/w*.It can be seen that when the width of the predicted changes, the JSIoU without the modification term is relatively smooth, while the JSIoU with the modification term decreases rapidly.Therefore, the network will converge to the ground truth faster using the modified JSIoU.

Loss Function
The head of the JSD-YOLO network divides each feature map into a number of cells, and each cell outputs a vector of ft x , t y , t w , t h , P obj , P θ g, where ðt x , t y , t w , t h Þ represents the position information of the grasp rectangle, and is obtained from the regression, t x , t y are used to calculate the offset between the center of the predicted rectangle and the corresponding anchor box, and t w , t h are used to calculate the width and height of the predicted rectangle.P obj denotes the confidence, and P θ denotes the probability for each of n angle categories.The grasp angle prediction is solved as a classification task, and the one with the highest classification prediction probability is considered as the best predicted angle.Therefore, the loss function consists of three parts, including oriented rectangular loss ℒ obox , confidence loss ℒ obj , and angular classification loss ℒ angle .The total loss function is given as follows where t and t Ã represent the prediction and ground truth vectors, respectively; K, S, and H are the numbers of the output feature map, cell, and anchors on each cell, respectively; hyperparameters λ 1 , λ 2 , and λ 3 are some constants; I obj kij indicates whether the jth anchor box of the ith cell in the kth output feature map is a positive sample, if so, I obj kij ¼ 1 and I obj kij ¼ 0 otherwise; and φ k is used to balance the weight of each scale output feature map.
The JSD loss function is defined as where JSIoU is defined in Equation ( 16).
To reduce the distances between the centroids of two Gaussian distributions, a centroid loss is added to the oriented rectangular loss.Therefore, the oriented rectangular loss function is defined as where t x and t y denote the coordinates of the predicted rectangle, t Ã x and t Ã y denote the center coordinates of the ground truth, and r 1 and r 2 are some weights.
The target confidence loss ℒ obj and the angle classification loss ℒ angle are calculated using focal loss with an α-balanced variant as follows where FLð⋅Þ is the focal loss with an α-balanced variant, P obj and P Ã obj denote the predicted and ground truth confidence score, respectively.P θ denotes the probability of the prediction for n angle categories, and P Ã θ denotes the CSL values of all angle categories of ground truth in Equation (2).
The focal loss with an α-balanced variant for variables u and v is calculated as FLðu, vÞ ¼ Àαvð1 À uÞ γ logðuÞ À ð1 À αÞð1 À vÞu γ logð1 À uÞ (22)   where γ is the hyperparameter of focal loss and α is a balance parameter and is usually chosen as α ¼ 0.25.γ is a focus parameter and is usually chosen as γ ¼ 1.5.The predicted offsets ðt x , t y , t w , t h Þ for four predicted parameterized coordinates ðx, y, w, hÞ are defined as follows where σ denotes sigmoid function, c x and c y denote the coordinates of the upper-left corner of the center point of the grid, and p w and p h denote the size of the corresponding anchor box.

Experiment and Evaluation
The proposed JSD-YOLO model is trained and evaluated on two open datasets and compared to the state-of-the-art methods.In this section, we first introduce the public datasets, training details, and evaluation metrics for in-plane grasp detection, and then demonstrate the results of JSD-YOLO on two datasets.

Datasets
There are a limited number of publicly available antipodal grasp datasets.The Cornell dataset [24] and visual manipulation relationship dataset (VMRD) [43] will be trained and used to verify the effectiveness of the proposed method.
The Cornell dataset is the most widely used dataset in robotic grasp detection.The Cornell dataset comprises of 885 RGB-D images with a resolution of 640 Â 480 pixels of 240 common objects, each labeled with several positive and negative grasp rectangles.The annotated ground truth consists of the coordinates of the four vertices of the grasp rectangles, with multiple grasp rectangles per object.
The VMRD dataset contains 31 categories of objects and 4683 images with grasps in total.In the VMRD, more than 100 k grasps are labeled, and the grasp rectangles are annotated in the same way as the Cornell dataset.In VMRD, two to five objects are stacked each other in each image, which is used to verify the effectiveness of our model in stacked scenes.
There are two splits for grasp datasets.Image-wise (IW) split: IW method splits the dataset randomly with the same probability for each image.The IW method is used to test the prediction ability of the model for the same object in different positions.
Object-wise (OW) split: OW method splits the dataset randomly in terms of objects, i.e., ensures that objects in the test set do not appear in the training set.The OW method is used to test the generalization ability of the model when encountering different objects.

Training Details
The JSD-YOLO is trained with a single NVIDIA GeForce RTX3060 GPU and 12 GB memory.The implementation framework is Pytorch with CUDA 11.1.The batchsize is set at 32, and the optimizer used is Adam.The initial learning rate is set to be 0.0001 and the weight decay coefficient is 0.0005.The number of epochs in training is 100.

Evaluation Metric
The evaluation of grasp detection results uses the rectangular metric, i.e., a predicted grasp rectangle is regarded as an available grasp configuration if both of the following conditions are satisfied.
The angle difference between the prediction and the ground truth is less than 30 ∘ , i.e.
where θ and θ Ã denote the predicted and ground truth grasp angles, respectively.The Jaccard index of the prediction and the ground truth is larger than 0.25, i.e.
where G and G Ã denote the predicted and ground truth rectangles, respectively.

Evaluation on Cornell Dataset
The four-point representation of the grasp rectangle in the dataset is converted to a 5D grasp rectangle representation.To fully train the model, we augment the dataset using random rotations, clips, Gaussian noise, and brightness, such that the augmented dataset has 8850 images with 51 k grasps.To avoid cropping the object itself or the labeled rectangle, we find the minimum bounding box for all labeled rectangles and take the center of the minimum bounding box as the center of clipping for the dataset with 320 Â 320 pixels.The augmented dataset is divided into a training set and a testing set with a ratio of 8∶2, where 90% of the training set is used for network training and the remaining 10% is used for validation.
The grasp detection results of JSD-YOLO and the methods in the literature are shown in Table 1.The experiment results show that the proposed model outperforms the existing models on the IW level, achieving 99.7% accuracy, and at the OW level, achieving 98.0% accuracy.Furthermore, the prediction speed of 14 ms per image suggests that JSD-YOLO is suitable for real-time closed-loop applications.
The detection results are shown in Figure 9.The first row is the confidence top 1 prediction.The second row contains predictions for multiple grasps.The third row is ground truth grasp rectangles.The confidence of the multi-grasp prediction results in the second row all exceeds 0.5.The confidence top 1 grasp detection results in the first row indicate that the results predicted by JSD-YOLO are reasonable.As can be seen from the multi-grasp prediction results in the second row, our model makes predictions based on the shape and characteristics of the object rather than just the fitting label.Many results not marked in the ground truth still appear in our model prediction.This indicates that our prediction grasp results can cover most representative grasp configurations effectively, which is also the advantage of the anchor-based detection method.
Although JSIoU without modification term can accurately identify the position of the object, the predicted grasp edge is obviously larger than that of the ground truth.However, the network will count such grasp rectangles with large edges as reasonable predictions, which will affect the final grasp detection results.From the fourth row of Figure 10, it can be seen that the modification term can adjust the height and width of the predicted grasp rectangle effectively, resulting in more reasonable grasp configurations.

Accuracy under Different Angle Thresholds
In Table 2, our model is tested under different angle thresholds in Equation (25).The results show that the model still achieves high accuracy when the angle threshold is set at   10°and 15°for IW and OW splits, respectively.Especially for the IW split dataset, the angle threshold decreases from 30°to 15°, and the accuracy of the model is only reduced by 0.6%, indicating that the JSD-YOLO model detects the grasp angle accurately.

Detection Results of Unseen Objects
JSD-YOLO is tested on unseen objects in different realistic and complex scenes.The categories of unseen objects never appear in the Cornell dataset.They are commonly seen in real life and have similar shapes to those in the Cornell dataset.The results are shown in Figure 11.It can be seen that in the case of occlusion, our model still has good performance and generates dense grasp rectangles, which verifies the generalization ability of the proposed model.

Ablation Studies
The ablation studies are performed and the results of ablation experiments are shown in Table 3.The baseline is the original YOLOv5s with simple angle classification of 18 classes (Class18).First, Class18 is replaced with CSL18 and CSL180, respectively, to verify the superiority of the CSL angle prediction method.
Then, for YOLOv5s with CSL18, CIoU is replaced with JSIoU to verify the effectiveness of the proposed JSD loss function.
Finally, the head of YOLOv5 is decoupled.Experimental results show that a decoupled head can improve the accuracy effectively.

Evaluation on VMRD Dataset
Considering the scenario when more than one object needs to be grasped in the actual grasp application of a robot, the model is also evaluated on VMRD dataset.During training, we augment the VMRD dataset with random brightness, random noise, and random flips.The augmented dataset has 14 k images, and 90% of the training set is used for training and the remaining 10% is used for validation.
The validation results with comparisons to previous works are summarized in Table 4.It can be seen that the proposed JSD- Zhou et al. [44] FCGN 54.5 Zhang et al. [7] ROI-GD 68.2 Yu et al. [13] EGNet 87.1 JSD-YOLO (ours) YOLOv5, CSL18, JSIoU 95.7 YOLO outperforms the state-of-the-art models, [7,13,44] which demonstrates its effectiveness in handling stacked scenarios.The detection results of the JSD-YOLO on VMRD dataset are shown in Figure 12.
From Table 1 and 4, it can be seen that the results of our model are consistent on different datasets, and JSD-YOLO achieves the highest accuracy on both datasets.

Conclusion
In this article, a one-stage real-time robotic grasp detection model JSD-YOLO is proposed.The model takes advantage of YOLOv5 and introduces an angle prediction branch to predict the oriented grasp rectangles.The decoupled head is proposed to solve the angle classification and rectangle parameter regression tasks separately, and hence, the accuracy of the model is significantly improved.Further, to tackle the boundary discontinuity problem of angle prediction and realize the periodic prediction of angle, a CSL angle classification method is proposed to predict the grasp angle, which can improve the detection performance.Moreover, a novel JSIoU based on JSD for oriented rectangles is presented, which can not only characterize the IoU of two oriented rectangles effectively but also prevent the network falling into singularities when two rectangles do not intersect.The proposed JSD-YOLO is evaluated on two public grasp datasets, including the Cornell and VMRD datasets, and compared with the state-ofthe-art methods.The proposed model is also validated on novel unseen objects and adversarial objects in cluttered scenarios.The results demonstrate that JSD-YOLO can perform accurate grasps in different complex scenarios.Moreover, the low inference time of our model makes it suitable for closed-loop robotic grasp applications.In general, JSD-YOLO can realize the planar grasp detection of objects and meet real-time requirements while ensuring accuracy.Future work includes conducting real-world experiments to validate the practical applicability of the proposed method.Moreover, grasp detection with unseen objects and occluded situations will be further investigated.

Figure 2 .
Figure 2. The architecture of JSD-YOLO.The network includes backbone, neck, feature map, and decoupled head.In the decoupled head, features are fed into two side by side branches.One branch performs angle classification and the other branch performs the regression with Focal loss and JSIoU loss, respectively.

Figure 4 .
Figure 4. CSL in angle prediction for a) 18 classes and b) 180 classes.

Figure 7 .
Figure 7. Simulation of the relative positions of two 2D Gaussian distributions.a) Combination of the relative positions of two ellipses.b) Comparisons of JSIoU and SkewIoU with varying distances between the center points of two ellipses.

Figure 8 .
Figure 8. Simulation of predicted and ground truth rectangles.a) The relative position and size of predicted and ground truth rectangles.b) Corresponding JSIoU.

Figure 9 .
Figure 9. Grasp detection results on Cornell dataset.The first row is the confidence top 1 prediction.The second row is multi-grasp predictions.The third row is ground truth grasp rectangles.

Figure 10 .
Figure 10.Comparison of grasp detection results with and without modification term.

Figure 11 .
Figure 11.Detection results in unseen objects.

Table 1 .
Comparison results on the Cornell dataset.

Table 3 .
Results of ablation experiments.

Table 4 .
Comparison results on the VMRD dataset.