A Monocular vision positioning and tracking system based on deep neural network

,


INTRODUCTION
Spatial positioning technology is one of the important research hotspots in the field of computer vision, which has been widely applied in autonomous driving, mobile robots, aerospace and other fields. Among the vision-based positioning methods, monocular visual positioning technology can only use one camera to complete the positioning work. Compared with multi-ocular visual positioning, it has the advantages of simple structure, high stability and good real-time performance [1]. Therefore, it is of practical significance and application value to carry out research on localization technology based on monocular vision.
Most of the traditional monocular vision positioning methods are based on the geometric principle. By establishing an appropriate model and combining the imaging process of the camera, the object's position and attitude can be solved. The most classic approach is the npoint Perspective solution (Perspective-n-Point, PnP) [2]. It used the projection relationship between n feature points in the image and their corresponding spatial points to determine the pose and position of the target relative to the camera, which is widely used in the field of computer vision. However, PnP solution has some disadvantages such as low accuracy and poor stability. Zhi et al. [3] proposed a real-time image registration and the target localization algorithm. This algorithm using improved ORB (Oriented FAST and Rotated BRIEF) to extract features, and use RANSAC algorithm to achieve precise matching and the transformation of the model parameters, CUDA is used to improve the real-time performance of localization, but the algorithm is highly dependent on the matching accuracy. Monocular visual localization technology is also widely used in SLAM (Simultaneous Localization and Mapping). For example, MonoSLAM algorithm, LSD-SLAM algorithm, ORB-SLAM algorithm and their improved algorithms have achieved good accuracy and effect in monocular visual positioning system. Liu et al. [4] proposed an algorithm combining object detection with ORB-SLAM2. It applies YOLOv4 target detection to the global mapping process, and uses ORB-SLAM2 for global mapping to determine the position and pose of the object in the world coordinate system, which provides more effective information for the positioning process. However, the internal parameters of the model established by this method are relatively large. For traditional methods, they require accurate mathematical models, which are relatively complex to solve. When the scenes are changing, the parameters need to be adjusted to meet new demands. At the same time, the adaptability and generalization ability of the traditional methods will be weak in the environment with many targets and complex background [5].
With the rapid development of deep learning, many localization algorithms based on neural networks have gradually emerged [6]. Taira et al. [7] proposed the InLoc method. It used multi-scale dense CNN features to achieve dense matching and performed pose verification through virtual view synthesis, which surpassing the highest level of indoor positioning accuracy at that time. However, this method still lacks enough information on pose selection.
Kendall et al. [8] proposed an end-to-end network PoseNet. A single picture could directly output the pose information after passing through this network, which could control the positioning error of indoor scenes within 0.5m. However, this algorithm could only predict the pose between frames, and its generalization ability and robustness were poor. In order to make full use of the information of sequence frames, Wang et al. [9] proposed the ESP-VO network model, which used deep recursive convolutional neural networks (RCNNs) to train and configure in an end-to-end manner. It could directly calculate the pose from a series of original RGB images, thus improving the positioning effect. However, the accuracy and robustness of this model need to be further improved. In addition, there are also localization methods that use neural networks for coordinate regression. In addition, Brachmann et al. [10] proposed a visual localization method based on neural network regression of scene coordinates DSAC++, which solved the visual localization problem and achieved high localization accuracy by learning the regression part of image scene coordinates and using local linearization to effectively optimize the poses, but the image information that can be utilized is limited. Sarlin et al. [11] proposed a scenario agnostic network called PixLoc. It learned data prior information from pixel to pose by end-to-end training, and optimized pose by separating model parameters and scene geometry. It could be used for accurate positioning of multiple scenes, but the algorithm needs good initialization and is susceptible to changes in perspective. Compared with traditional positioning methods, positioning algorithm based on neural network can obtain more accurate positioning and better generalization performance through end-to-end learning without establishing complex geometric models, which is a research hotspot of current visual positioning algorithms [12]. The above localization algorithms based on neural networks have achieved good localization results, but there are still some problems, such as it is difficult to achieve high positioning accuracy for dynamic objects, and the process of obtaining or labeling training data is cumbersome.
Based on the requirements of easy installation, high accuracy and fast speed of the sentry system in ICRA Artificial Intelligence Challenge, this paper proposes a monocular visual positioning and tracking system based on neural network. It uses the information obtained from target detection to establish MLP regression model to obtain the coordinates of the robot in the field, which can achieve better accuracy and speed for robot localization in the process of movement. The effectiveness and accuracy are verified in the robot competitions.

APPLICATION SCENARIOS
The ICRA Artificial Intelligence Challenge adopts the form of automatic shooting confrontation between the red and blue robots. Our robots need to sense the battlefield environment, find and hit the enemy robots' armor to win. In order to accurately obtain the enemy's position, the sentry system is required to provide the robot with the coordinate information of the opponent in the field.
The competition field is shown in Figure 1. The robot, armor area and visual tags are all marked in Figure 1. The sentry system consisting of cameras is fixedly placed on the brackets on both sides of the diagonal of the field to ensure that the whole field of vision can be captured.

FIGURE 1. ICRA robot competition field
In order to ensure the real-time and accuracy of localization, the lightweight GhostNet-yolov5 algorithm is used in the target detection part, and the traditional mathematical solution method and MLP neural network algorithm are respectively used in the spatial localization part for two groups of comparison experiments. The overall architecture of the positioning system is shown in Figure 2.

YOLO v5 OBJECT DETECTION ALGORITHM
Object detection is a prerequisite for accurate spatial localization. Commonly used deep learning object detection algorithms include two-stage object detection algorithm (such as Faster R-CNN) and one-stage object detection algorithm (such as YOLO series). Compared with the twostage algorithm, YOLO series can directly regression the category, confidence degree and location of the target through the network, with fast detection speed and strong advantages in model deployment. Since YOLOv1 was proposed in 2016, YOLO series has been continuously updated and improved [13][14][15], and has been widely used in intelligent transportation, defect detection, face recognition and other scenes. Therefore, this paper adopts the latest YOLOv5 for object detection, providing algorithm support for subsequent spatial localization.
The YOLOv5 algorithm inputs the entire image into the network, and directly returns the position and category of the bounding box at the output layer by meshing the image. The network structure of YOLOv5 is divided into four parts: input layer, Backbone, Neck network and prediction layer. The input layer uses Mosaic data enhancement, adaptive anchor calculation and adaptive scaling of the image to improve the speed and robustness of target detection. The Backbone part uses a series of convolutional neural networks, including normal convolution, Focus, BottleneckCSP and SPP structures to extract the features of the image and increase the acceptance range of the backbone features. The Neck network uses the FPN + PAN network structure to aggregate features, which strengthens the feature information of the network and pass it to the prediction layer [16]. The CIOU loss function and weighted NMS non-maximum suppression are used at the prediction layer to generate the output results and improve the detection accuracy.
In this paper, the structure of Yolov5s network is optimized, and two lightweight networks MobileNetV2 [17] and GhostNet [18] are respectively used as backbone networks to replace the original backbone for training and effect testing. MobileNetV2 follows the deepwise separable convolutions from MobileNetV1 network and introduces the inverted residual structure and linear bottleneck module. The inverted residual structure firstly uses 1×1 convolution to increase the dimension, then uses 3×3 deep convolution to extract features, and finally uses 1×1 point-by-point convolution to reduce the dimension, which strengthens the feature extraction ability of the network. The linear bottleneck layer uses linear convolution to replace the combination of original convolution and ReLU (Rectified Linear Unit) function [19], which helps to retain information. The MobileNetV2 network greatly reduces the computational cost while improving the accuracy. The core of GhostNet is Ghost module, which firstly uses ordinary convolution to obtain a small number of feature maps, then performs cheap operation to generate redundant feature maps, and finally uses concat operation to obtain feature maps of the same size as the original feature maps generated by these two steps. The SE attention module [20] is also added to GhostNet for better feature extraction. Compared with ordinary convolutional neural networks, the design of this module reduces the computation cost of network parameters. Figure 3 shows the structure of the yolov5s-MobileNetV2 network. The original backbone network is replaced by 17 inverse residual modules from MobileNetV2. The original Focus structure is replaced by the standard convolutional module, which consists of a convolutional BN layer and a SiLU activation function. Figure 4 shows the structure of yolov5s-GhostNet network. The backbone network consists of Focus structure, GhostBottleNeck structure, SE attention module and SPP structure. The backbone network uses Ghost module and SE attention module (named GhostBottleNeck) instead of the original C3 module.

OBJECT SPATIAL POSITION MODEL BASED ON MATHEMATICAL SOLUTION
Based on the information of visual labels in the competition field, this paper designs a mathematical solution method. In the competition, the center point of the bounding box obtained by the target detection is used as the known pixel coordinates, the four corner points of the visual tags in the field are used as the feature points, then according to the transformation relationship between the pixel coordinate system, the camera coordinate system, the tag coordinate system and the world coordinate system, the absolute pose relationship between the target and the camera can be calculated. The coordinate system conversion equations in this process are as follows: 1) Transformation of the pixel coordinate system to the camera coordinate system: let be the camera internal matrix, (u,v,1) is the pixel coordinate, is using pixels to describe the length of the focal length along the x-axis, is using pixels to describe the length of the focal length along the y-axis, , is the principal point coordinates, and ( , , ) is the camera coordinate, by the formula (1): From formula (1), we can obtain the formula (2): 2) Let ( , , ) be the tag coordinates (the direction of the tag coordinate system is shown in Figure 1), and , be the rotation and translation from the label coordinate system to the camera coordinate system by the formula (3): Next the formula (4) can be obtained: 3) Transformation of the tag coordinate system to the world coordinate system: The rotation and translation of the visual tags in the field transformed to the site coordinate system are represented by , respectively, ( , , ) is the world coordinate (the direction of the world coordinate system is shown in Figure 1), so we have the formula (5): By substituting formula (4) and formula (2) into formula (5), we can obtain the formula (6): 2) Calculate the unknown parameters in formula (6) Considering that all robots in the competition field can be regarded to be moving on the same horizontal plane, so can be treated as 0 in formula (7). Through the above mathematical solution method, the value of world coordinates can be obtained according to the transformation of each coordinate system.

SPATIAL LOCATION MODEL BASED ON MLP NEURAL NETWORK
In Section 4, we discussed how to use mathematical solution method to obtain the coordinates of the robot in the field. In this section, we designed an MLP neural network to achieve the location of the robot. The MLP neural network uses the information acquired after object detection to train the regression model, then obtains the site coordinates of the robot in an end-to-end way.
The MLP neural network is mainly composed of input layer, hidden layer and output layer. There can be multiple neurons in each hidden layer, and the different layers are fully connected. Its learning process consists of forward propagation of signal and backward propagation of error [21]. Weights, biases and activation functions are the three important elements of the network. The weights represent the strength and importance of the connections between neurons, the bias is to control the activation state of neurons, and the activation function is to make a nonlinear mapping between input and output. MLP neural network has strong nonlinear fitting ability and adaptive learning ability, which can deal with complex multi-input and multi-output nonlinear systems.
In neural networks, the number of nodes and layers of the hidden layer can fully regulate the neural network, and the use of activation function also affects the nonlinear fitting ability of the network, so it is very important to set the parameters of the hidden layer. According to the competition requirements, this experiment independently built a MLP neural network with four inputs and two outputs to train the regression model. The four parameters in the input layer are the upper left coordinates, width and height of the bounding box obtained after object detection, and the two parameters in the output layer are the coordinates of the robot in the field. In order to make the MLP network have good generalization performance, it is necessary to adjust the number of nodes and layers in the hidden layer during the design. Theoretically, the increase of the number of hidden layers can bring more complex computing power, but it may also cause the increase of computing time and the problem of overfitting. Therefore, the network chooses two hidden layers. The number of neurons in the hidden layer is usually adjusted by trial-and-error method, that is, the number of nodes in each layer is gradually increased from less to more until the accuracy of the model can't be improved any more. In general, a layer with more nodes followed by a layer with fewer nodes will have better performance.
According to the above principles, the final structure of the hidden layer is as follows: A total of two hidden layers were designed, with the number of 128, 64 respectively. In order to ensure the learning efficiency, ReLU activation function was used in both hidden layers, and Dropout was used to randomly drop some proportion of neuron nodes to prevent overfitting. The output layer uses Sigmoid activation function, and the specific MLP neural network structure is shown in the figure 5.

EXPERIMENTAL ENVIRONMENT
The specific experimental environment is shown in Table 1. The labels in PASCAL VOC format in DJI COCO open source dataset are converted to YOLO format, and the locally collected dataset is directly labeled to YOLO format by LabelImg annotation software. After integration, all data are divided into training set and test set in a ratio of 8:2.

EXPERIMENT AND ANALYSIS
In the experiment, the total training epochs is 1000, the batch size is 32, the image size is 640×640, and the learning rate is 0.001. Yolov5s is selected as the training model, and the backbone network structure of Yolov5s is replaced by MobileNetV2 and GhostNet respectively. Three sets of experiments were designed and compared. At the same time, in order to make full use of resources and speed up the inference speed, the three best models after training were deployed on the C++ project using the LibTorch framework, and the frame rate is used to reflect the inference speed. The obtained experimental results are shown in Table 2. From Table 2, we can see that all three object detection models have obtained good detection results. Among them, after MobileNetV2 and GhostNet are introduced as the network's backbone, the model maintains high accuracy while the number of network parameters are reduced, and the complexity of the network is also decreased, which has the effect of lightweighting. Since the GhostNet-yolov5s model among the three has the advantages of higher detection accuracy, smaller parameter size, and faster inference, this model is finally chosen as the base model for target detection. Figure 6 shows the effect of real-time target detection.

DATA AND PRE-PROCESSING
The bounding box information obtained after target detection is saved as the input data for the MLP regression model, and the coordinates obtained by the robot using the AMCL localization algorithm are used as labels. 3000 sets of data are collected and divided into 2600 sets of training data and 400 sets of test data.
Before training the regression model, good data preprocessing can play an important role in the training process and the performance of the neural network. For the training data in this experiment, since the input features and labels have different sources and units of measure, the input data and labels in the dataset are normalized to the same value interval [0,1]. For the input data, the two coordinate values of the bounding box as well as its width and height are divided by the image resolution of vertical and horizontal pixel values respectively; for the label data, divide the two actual coordinates by the width and height of the field respectively. This normalization operation reduces the amount of computation and accelerates the convergence of the model, which can achieve ideal results.

EXPERIMENT AND ANALYSIS
After data preprocessing, MLP regression model training can be carried out based on Pytorch1.10 deep learning framework. Epochs is set to 1000 and Adam optimizer is used, the learning rate is 0.001 and the loss function is SmoothL1 Loss. The convergence process of loss function is visualized as shown in Figure 7. Next, the data in the test set are tested and the errors in the x-direction and y-direction between the predicted and true values in the test set were visualized. In order to compare two methods, 400 sets of data are also collected for testing using the mathematical solution algorithm and the errors of the coordinate data are visualized. The final error visualization results of the two methods are shown in Figure 8.  From the above results, it can be seen that the error peak and error range of the neural network model are significantly smaller than those of the mathematical solution model, which initially reflects the superiority and stability of the neural network algorithm. On this basis, the mean error values of the test data in x and y directions are further compared, and the results are shown in Table 3. The analysis of the above data shows that the error of coordinate values obtained by the mathematical solution model is much larger than that obtained by the MLP regression model, which proves that the prediction performance of the neural network model is better. For the neural network model, the MLP reflects an end-to-end mapping relationship, so the source of error is mainly the input data obtained after the target detection. For the mathematical solution model, the solution process involves monocular camera calibration, PnP pose estimation, selection of visual tag's corners and other steps, so the sources of errors are more extensive, and the corresponding effect will become worse. Among them, in view of the error caused by the selection of visual tag's corners, the coordinates of four corners of five groups of visual tags are manually selected and mathematically solved, and the experimental results are shown in Table 4. From Table 4, it can be seen that the selection of different tag corner points does bring some error to the mathematical solution algorithm, which affects the accuracy of this method. This error is larger than that of the neural network model, which also proves the accuracy and superiority of the neural network model for robot positioning in this experiment.
Finally, the MLP neural network model is deployed by LibTorch, and the coordinates predicted by the model are back-normalized to the original order of magnitude. The detection results of two monocular cameras located on both sides of the diagonal of the field are fused to obtain the coordinates of the robot. The results are shown in Figure 9. In the perspective of Figure (a), the origin of the site coordinate system is located at the upper left corner of the field. For the Red No.1 robot, the coordinates detected by the camera in the lower left corner of the field (denoted as camera 1, the image is shown in Figure (a)) are (2.47m, 3.07m), and the coordinates detected by the camera in the upper right corner of the field (denoted as camera 2, the image is shown in Figure (b)) are (5.57m, 2.42m). Then the detection results of the two cameras are fused, and the fused coordinates (2.49 m, 2.85 m) are displayed in the simulation map. It can be seen that the position of the red No.1 robot in the simulated field is almost the same as the actual field, as well as the positions of the other two robots, thus verifying the accuracy of the positioning and tracking system based on neural network. In the real-time tracking process of the robot, the positioning can also get a good effect, which provides a good help for the robot in the competition.

CONCLUSION
A monocular visual positioning and tracking system based on neural network is proposed. Firstly, the light-weight network structure GhostNet is used to replace backbone of YOLOv5, and then target detection is carried out based on the improved GhostNet-Yolov5 algorithm. Next, MLP neural network was used to establish a regression model to predict the coordinates of the robot in the site according to the information of the bounding box. The results of ICRA AI Challenge in 2021 show that the proposed method is reliable and practical.

AUTHOR BIOGRAPHY
Huijun Li received the Ph.D. degree in engineering from the Institute of Automation, Chinese Academy of Sciences, 2008. He is an associate professor and has been working in the School of Information and Control Engineering, China University of Mining and Technology. He is mainly engaged in the research of intelligent robots, machine vision and its applications. He has published more than 10 academic papers in professional journals. He has applied for 4 invention patents, 2 utility model patents, and 4 software Copyrights. He has won 1 Excellent Instructor Award of RoboMaster2016 National College Student Robot Competition. Guided students to win the champion of the eastern division of RoboMaster2016 national college student robotics competition and the first prize of the national finals. He participated in one 863 sub-project, two National Natural Science Foundation projects, and undertook more than 10 projects entrusted by enterprises.
Yu Zhang received the B.S. degree in Measurement and control technology and instruments from Yantai University, Shandong, China, in 2016. He is currently a Master student from China University of Mining and Technology and majors in Control science and Engineering. His research interests covers computer vision and autonomous driving perception.