Deep Neural Network–Based Loop Detection for Visual Simultaneous Localization and Mapping Featuring Both Points and Lines

Herein, a visual simultaneous localization and mapping (SLAM) is proposed in which both points and lines are extracted as features and a deep neural network is adopted for loop detection. Its working principles, including the representation, extraction, description, and matching of lines, initialization, keyframe selection, optimization of tracking and mapping, and loop detection using a deep neural network, are set forth in detail. The overall trajectory estimation and loop detection performance is investigated using the TUM RGB‐D (indoor) benchmark and KITTI (outdoor) datasets. Compared with the conventional SLAMs, the experimental results of this study indicate that the proposed SLAM is able to improve the accuracy and robustness of trajectory estimation, especially for the scenes with insufficient points. As for loop detection, the deep neural network turns out to be superior to the traditional bag‐of‐words model, because it decreases the accumulated errors of both the estimated trajectory and reconstructed scenes.


Introduction
Over the past two decades, simultaneous localization and mapping (SLAM) has been gaining ground in applications such as augmented reality [1] and autonomous driving. [2] With SLAM, not only can the trajectory of the moving object be estimated, but also the surrounding 3D scene can be reconstructed in real time. [3][4][5] To date, a variety of SLAMs using various sensors-such as lasers, inertial measurement units, and cameras-have been proposed. [6] Visual SLAM, among others, refers to a type of SLAM that relies on cameras. It can be divided into two subcategories, i.e., feature-based and direct. [7] For example, MonoSLAM, [8] PTAM, [9] and ORB-SLAM [10] are feature-based SLAMs and line segment detection SLAM (LSD-SLAM) [11] and DTAM [12] are direct SLAMs. Compared to the direct ones, the feature-based SLAMs are more flexible, easier to initialize, and faster to run. Plus, they are more robust against image noise and large geometric distortions. Therefore, the feature-based SLAMs are the most widely adopted in recent years. [13][14][15] The entire process of featurebased SLAMs usually involves extraction, description, and matching of the features for tracking trajectories of the camera, reconstruction and maintenance of the local map, loop detection, and optimization of tracking and mapping for all estimated results.
Feature-based SLAMs, which extract points as features, are also called "point-based SLAMs." Unfortunately, the performance of point-based SLAMs deteriorates and even fails for scenes with insufficient points, because it is difficult to find enough reliable points among the images for matching and tracking. In addition to points, lines are another important feature, and can also be used to describe a scene or map. [16][17][18] In most cases, points represent the corners while lines represent the edges. Compared to points, lines are more accurate and reliable in situations with an obstacle or a moving visual angle, and they are less sensitive to the variation of lighting conditions. Therefore, adding lines to the current point-based SLAMs e could compensate for the lack of points in some scenes. [19][20][21] Furthermore, the reconstructed maps of scenes comprising both spatial points and lines can provide more structural information than those comprising points only.
Loop detection is another crucial process of visual SLAM, which is responsible for recognizing the places formerly visited by computing the similarity among the images. [22][23][24] It is used to reduce the accumulated error of the estimated trajectories through loop fusion. The traditional method for loop detection is known as the bag-of-words (BoW) model. [25] This model extracts features from images by relative feature detection methods and clusters their descriptors as words (using the k-means clustering method) to build a visual dictionary. [26] Then, the corresponding words are used to represent the observed images and compute the similarity among them. [27][28][29] Finally, the two most similar but discontinuous images are identified as a loop. However, the perceptual variability is not been handled well, because images from the same scene may change in response to the light conditions or the movement of the cameras, resulting in totally different features. Therefore, when it comes to the actual applications, BoW may incorrectly detect the loops in certain scenes. A wrong detection easily produces a larger error in both the position and the orientation of the cameras, which can usually be categorized as two types. [30] One is the false positive, which means that two different but similar scenes are treated as loops. The other is the false negative, which means that two same but different-look scenes are not treated as loops. A good loop detection should be able to distinguish between these two types of situations and return correct detection results.
Due to the limitations of BoW, the accumulated error of visual SLAMs could not be reduced. To improve the accuracy, a deep neural network (DNN) for loop detection is proposed. [31] The purpose of the deep neural network is to train a classification representation form from the original sequences and represent each image in the sequences through the trained features. If a representation is carefully trained from the sequence, it can recognize whether an upcoming observation is from the same or from different scenes.
In this study, a visual SLAM is introduced in which both points and lines are exploited to enhance accuracy and robustness. A DNN is used to detect the loops in scenes and consequently reduce the accumulated error of the estimated trajectories and reconstructed scenes. The experiments are conducted on open datasets. The experimental results indicate that the proposed SLAM outperforms the traditional visual SLAMs in terms of the accuracy and robustness of both trajectory estimation and loop detection, especially in scenes with insufficient points.

Representation of Lines
In a 3D space, the representation of lines is more complicated than that of points. A triplet of ðx, y, zÞ T is sufficient to uniquely represent a point. In contrast, lines have four degrees of freedom. Based on the summary done by Bartoli and Sturm,[32] several methods have been proposed to represent lines, and two of them are carefully chosen in our solution: the Plücker line coordinate and orthonormal representation. The former is mainly exploited to initialize the newly detected lines and to describe the projection process of the lines from the 3D space to the 2D image plane, whereas the latter is used for optimization. Their representations are shown in Figure 1. [16] A 3D line in Plücker line coordinates can be represented by two vectors, namely L ¼ ðn T , v T Þ T , where n is the vector normal to the interpretation plane containing the origin and the line and v is the direction vector of the line. Both of them are unit vectors. The Plücker line coordinate is represented by two points P ¼ ½x 1 , y 1 , z 1 , w 1 T and Q ¼ ½x 2 , y 2 , z 2 , w 2 T (the homogeneous coordinate) on the line in the Cartesian coordinate Since vectors n and v are perpendicular to each other, there is a constraint, i.e. n T v ¼ 0. This constraint makes the Plücker line coordinates unsuitable for the optimization of the visual SLAM. Therefore, another representation, the orthonormal representation, is chosen for the optimization. The orthonormal representation has at least four parameters to represent the 3D coordinate of a line. When the Plücker line coordinate of a line is defined, its orthonormal representation can be computed by the QR decomposition [33] where U and W are the 3D and 2D rotation matrices, respectively. Inversely, the orthonormal representation can also be converted back to the Plücker line coordinates by where u i is the ith column of U, and ω i is the parameter of W.
Because of the limitation of orthonormal representation in projection and transformation, it is only used in the optimization, whereas the Plücker line coordinates are used to represent the lines in other processes. When the external parameters-the pose ðR, tÞ of the camera motion-are determined, the coordinate transformation of lines between the last frame i and the current frame j can be described in Equation (5) by the Plücker line coordinates, which is a linear operation, formulated as

Extraction of Lines
Extracting the representative lines to represent each image of sequences can substantially reduce the computational complexity of SLAM. A fast line detection method, known as "line segment detection," which only has linear O(n) time complexity, is exploited. [34] LSD detects lines through the gradient with a false detection control, and it can automatically detect the correct lines even for noisy images and complicated scenes without parameter tuning. However, LSD suffers from several problems in practical applications. For example, a complete line may be marked as several different lines in the image. Thus, directly employing the pristine LSD method in the SLAM to extract lines may result in matching and tracking failures. To improve the accuracy and robustness of line detection, an optimized LSD method is proposed. [21] If two lines detected by LSD are of the same direction and their endpoints are very close (within the specified threshold), they will be judged as the same line, and their endpoints will be merged. Consequently, the estimated trajectory error of the LSD method caused by line mismatching can be decreased.

Description of Lines
The description of lines is the foundation of line matching. To achieve correct and robust matching for a camera's pose estimation, the line-based eight-directional histogram feature (LEHF) [35] is chosen to describe the lines detected by LSD, which is based on the mean-standard deviation line descriptor (MSLD) [36] and uses a strategy similar to scale-invariant feature transform (SIFT). [37] The LEHF provides a fast and efficient way to describe detected lines and to match the corresponding lines between the 2D image plane and the 3D scene space. It can track the lines accurately and robustly even in scenes with local occlusion, rotation changes, image blur, and illumination changes. LEHF descriptors describe the detected lines in terms of the gradient histograms of eight directions. First, a certain number of points around the line direction and the vertical direction of the line are selected, and their gradient vectors are calculated by differential values of x and y directions. Then, the gradient histogram of the eight directions can be computed through merging these gradient vectors. Finally, the rotation invariant of lines can be obtained by rectifying the direction of gradient vectors.

Matching of Lines
A sound matching strategy can improve the accuracy, robustness, and latency of SLAM. 2D-3D correspondences are established by matching the LEHF descriptors to estimate the current pose of the camera. To match the lines between adjacent images, the lines tracked in the 3D scene space in the previous frame are projected to the current 2D image plane, and the matched lines between the descriptors of projected lines and observed lines are found. According to a specific reprojection error threshold calculated in random sample consensus (RANSAC), each 3D line projected to the image plane will be labeled with several states: inlier, outlier, or out of the image plane. [35] In our solution, only the inliers are considered as valid lines, and other lines are discarded. The line matching process is shown in Figure 2.
To reduce the computational complexity, the making range of the line is restricted, and only the lines observed within the threshold are selected to calculate the correlative descriptors, which are represented by the red dashed ones in Figure 2. LEHF distances of these 2D lines are computed, and the lines matched with the 2D line corresponding to the 3D line should have the minimum Euclidean distance between LEHFs.

Deep Neural Network for Loop Detection
The proposed DNN for loop detection can be decomposed into two parts: 1) image feature training, in which an autoencoder is used to train the representative features of images, and [2] ) loop detection, in which loops are detected by using a similarity or difference matrix.

Feature Training
In the proposed DNN-based loop detection, a stacked autoencoder [38] is adopted to train features from original images and to represent each image using the first hidden layer of the network. The process of training image features is shown in Figure 3.
Before training, the original images are divided into 10 Â 10 patches. After being vectorized, the patches are set as vectors fxjx ∈ R 100 g and are fed into the autoencoder to train the representative features. The autoencoder is an unsupervised training method of a DNN. It mainly contains three layers: 1) the first layer is the input layer x that directly connects to the vectorized original images; 2) the second layer is the hidden layer h; and 3) the third layer is the output layer y, and this layer will be discarded after the feature training process is completed. According to the definition of the autoencoder, the results of the output Figure 2. Matching process of lines of the adjacent images: 1) projecting the tracked 3D lines in the previous frame to the current 2D image plane and 2) finding the matched lines between the descriptors of projected lines and observed lines within the threshold, which are represented by the red dashed lines, and are to be used to compute the LEHF descriptors and to match with the descriptor of the projected line.
www.advancedsciencenews.com www.advintellsyst.com layer can usually recover the input as much as possible.
A response a of the former layer with a weight vector w and a bias b through the activation function is computed in each unit of the layers. The activation function, i.e., sigmoid σ, written as is selected and added to increase the nonlinearity of the network. The process of feature training is modeled as an optimization problem, whose loss function is the distance between the input value x and the output value y. The iteration functions are conducted with the trained parameters (weight vector w and bias b), and the purpose of the iteration is to obtain the optimal network parameters, where the parameter η in Equation (7) is the learning rate and the d is the loss function if the input x ∈ ð0, 1Þ, given by To obtain more meaningful features of the network for the loop detection, the loss function of the autoencoder is optimized from three additional aspects: 1) denoising-as the images are usually affected by undesired noise, a corruption is added to the input x, which randomly masks a certain percentage of the input to zero; 2) sparsity-because the dimension of the hidden layer h is larger than the input x, a sparse structure is trained to avoid the overfitting; and 3) continuity-because it is the continuity during the movement of the camera. The denoising and sparsity are mainly aimed at the autoencoder, whereas the continuity is unique to the SLAM. These three additional penalty items are added to the original autoencoder. The corresponding loss function is represented by in which x $ is the corrupted input by denoising. λ and γ are the respective penalty weights defined to balance the effect of the sparsity and continuity, and both of them are hyperparameters of the DNN.

Loop Detection
After training the DNN, the trained features are collected to construct a representation of the scenes. The features can be obtained from the response of the hidden layer and by discarding the output layer of the network. In this way, the images are represented in another form, which is difficult to intuitively understand but can be recognized easily and effectively by the computer. According to the principle that a similar input will output similar features, for any two images in the trajectories of camera movement, the difference matrix of two arbitrary scenes (m, n) can be computed from where D is a difference matrix and z is a weight matrix to reduce the effect of high average response According to the properties of the difference matrix, a more obvious matrix can be obtained by normalization and the rank can be reduced through the singular value decomposition. In the matrix of the decomposition, the larger eigen k À 1 values are discarded, and the effect of scene ambiguity can be reduced. The matrix used to detect the loops is formulated as Original images Patches Vectors Auto-encoder Figure 3. Process of training the image features. The deep neural network-the autoencoder-is adopted to train the features of images. Before training, the original images are divided into 10 Â 10 patches. After being vectorized, the patches are set as vectors: fxjx ∈ R 100 g and are fed into the autoencoder to train the representative features of images.
www.advancedsciencenews.com www.advintellsyst.com If D R ðm, nÞ is large, two scenes are thought to be dissimilar. Conversely, if D R ðm, nÞ is small, two scenes are thought to be similar.

Architecture
A visual SLAM based on both points and lines is proposed to estimate the trajectory of the camera and to reconstruct unknown scenes. The DNN-based loop detection, on the other hand, is adopted to recognize the previous scenes. The proposed SLAM consists of the main process of the visual SLAM. The point-based ORB-SLAM [10] is used as the basic framework, and four mainly parallel threads are retained, on top of which the expanding strategies are implemented for points and lines, and loop detection. Figure 4 outlines the architecture of the proposed SLAM, which mainly consists of four main threads. Thread 1. TRACKING: through this thread, the extraction and tracking of points and lines are completed; the camera's localization is obtained and the local map is tracked; and whether the current frame is a keyframe is determined. This thread exploits the optimization to minimize the reprojection error, and only the camera's pose in the local map is used as the state variables. Thread 2.
LOCAL MAPPING: the local maps and keyframes are managed and optimized, including keyframe inserting and culling, and feature creating and culling. The 3D spatial features, i.e., points and lines, are assigned as optimization variables. Thread 3. LOOP DETECTION: the loops are detected by the proposed DNN, and the accumulated error through the loop fusion is consequently corrected. Thread 4. GLOBAL OPTIMIZATOIN: the purpose of this thread is to reduce the accumulated errors so as to correct the camera trajectory. Figure 5 shows the computing process of the proposed SLAM. Two independently parallel threads for points and lines, respectively, are supposed to make reasonable use of the computing resources of the platform so as to decrease the computational complexity. After extracting and describing, the points and lines are used to carry out the feature matching, which is followed by the camera pose evaluation and scene reconstruction.

Initialization
The initialization is a procedure of the tracking, and its purpose is to compute the relative camera pose between two images and retrieve an initial map composed of points and lines. The points and lines in 2D image planes are projected back into the 3D space by the geometry relationship in the initialization process.
To reduce the computational complexity and increase the efficiency, the points are preferentially used to initialize. When the point-based initialization fails, the lines then step in to restart the initialization. If there are not enough valid points and lines, the whole initialization process will fail and the reference frame  www.advancedsciencenews.com www.advintellsyst.com needs to be reset. The entire initialization process is plotted in Figure 6. The green line represents the initialization process of points, while the yellow one represents the initialization process of lines. The point-based initialization is shown in Figure 7, where two scenes need to be considered. When using points, a planar scene is assumed by a homography matrix, and a nonplanar scene is assumed by a fundamental matrix. The homograph matrix H cr and fundamental matrix F cr are calculated in parallel as

Global optimization
where x c and x r are the matching points in the current frame and the reference frame, respectively. The score S M for the model M (the M represents the homograph matrix or fundamental matrix) is computed by  . Initialization process with points or lines. The green line represents the initialization with points, which is prioritized to be used, while the orange one represents the initialization with lines. If there are not enough points and lines in the frame, the initialization will fail and the reference frame will need to be reset.
1st frame 2nd frame R, t Figure 7. Initialization by points from two consecutive images. There are two scenes that need to be considered when using points; i.e., a planar scene is assumed by a homograph matrix and a nonplanar scene is assumed by a fundamental matrix. A robust heuristic parameter is defined by computing these two scenes to judge which one of the matrices can be used to compute the camera's pose: R and t.
www.advancedsciencenews.com www.advintellsyst.com where the ρ M is A robust heuristic parameter R H , defined as is used to obtain the specific model. For example, R H > 0.45 indicates that the captured image is planar and has low parallax, and the homograph matrix will be selected to complete the initialization. Otherwise, the fundamental matrix will be chosen. The initialization process of lines is shown in Figure 8. Three consecutive images containing the same line are used for initialization. The rotation between the consecutive camera poses is assumed to be small and equally continuous. These three camera rotations are marked as R 1 ¼ R T , R 2 ¼ I, and R 3 ¼ R. Under the assumption in the consecutive three frames of Figure 8, there is a constraint In addition, for small rotations, R is represented as The transformation relationship can be computed by the total number of matched features in these three consecutive images, and then each value of R can be obtained.

Keyframe Selection
When the current image no longer matches the latest keyframe, a new one should be selected and inserted into the local map. Since the ORB-SLAM only focuses on the points, several changes are made according to the structural properties of the lines. A scale factor α [39] is added to describe the ratio of the motion estimation entropy between the previous keyframe i À u and the current one i to the entropy between the previous frame i À u and its consecutive frame i À u þ 1, which is written as where the ξ represents the Lie group of the relative pose, [40] and the entropy at each time is given by If the value of α is lower than the preestablished threshold, which is set as 0.9, the current frame is inserted to the map as a new keyframe, i.e., the input of the local mapping.

Graph Optimization
Graph optimization is used to estimate each state, including the trajectory of the camera and the localization of the detected points and lines in the scenes. The graph consists of vertices and edges. The vertices represent the variables that need to be optimized, while the edges represent the corresponding relations among these optimized variables. Figure 9 is an illustration of  www.advancedsciencenews.com www.advintellsyst.com the graph optimization. [41] The circles are the landmarks observed by the camera, and they constitute the vertices of the graph along with the camera poses. The blue lines are the motion model of the camera, and the red dashed lines represent the observation model of each image. These two kinds of lines represent the optimized edges of the graph. The detected landmarks consist of both points and lines, and the edges involve the reprojection error of tracked points and lines in the scenes. As shown in Figure 10, an observation model is built to describe the reprojection error of the points and lines. For points, the camera pose is marked as T cw , and the intrinsic parameter of the camera is marked as K. A 3D point X in the world coordinate is projected to the camera coordinate X c , and then the point x 0 in the image coordinate can be computed according to the intrinsic K. Finally, the error between the matched point and the projected point can be calculated. The whole process is mathematically expressed as The reprojection error of the lines can be obtained by a similar process. A space line L is projected into the image plane, which is marked as l 0 , and the reprojection error is represented by the two endpoints of the line x s and x e in the image. The whole process is mathematically expressed as Ideally, the projected points and lines coincide with the observed points and lines. However, due to the errors of both projection and observation, they are not completely coincident. For this reason, both errors are treated as edges between the vertices in the graph optimization. These two edges are determined by and There is an assumption that the errors satisfy the Gaussian distribution, and the covariance matrices of the points and lines are Σ p and Σ l , respectively. The final objective function C to be optimized is the sum of all the point projection errors and the line projection errors, i.e.
where the ρ p and ρ l are the Huber cost functions, which are used to reduce the outliers of the objective function and to keep the observed features as close to the projected features as possible.

Results and Discussion
Our experiments are composed of two parts: the accuracy of the loop detection in the proposed SLAM based on the DNN is evaluated in the first part and the second part demonstrates the accuracy and robustness of the trajectory estimation. The experiments are conducted on the TUM RGB-D (indoor) benchmark [42] and KITTI (outdoor) datasets. [43]

Loop Detection
The accuracy of the DNN-based loop detection in the proposed SLAM is validated through a set of experiments. TensorFlow, an open-source software library of deep learning, [44] is used to build the autoencoder, and the datasets come from the TUM RGB-D benchmark, which contains both the RGB-D images and their ground truth trajectories. Each sequence of the images is www.advancedsciencenews.com www.advintellsyst.com randomly sampled, and 30% of them are used for training, while 20% of them are used for testing the ability of the network to represent other images. To obtain accurate and robust features of the trained network, several hyperparameters related to the DNN are used in the process of training visual features, which are itemized in Table 1.
The autoencoder comprises a total of three hidden layers, and the units in each layer are 200, 50, and 50, respectively. First, the influence of some hyperparameters on the performance of the neural network is evaluated. The corruption level, which corresponds to the denoising, is added to the weight matrix W of the first hidden layer. Since noises always exist in the images, adding the corruption is indispensable for increasing the robustness and preventing the network from being overfitted.
In the absence of corruption, namely, when the corruption level is 0, in the weight matrix from the hidden layer, the meaningful information in the image will be clouded by the noise, as shown in Figure 11a. When corruption is present, as shown in Figure 11b-d, sparse edge features of the images can be observed. Out of the different corruption levels, the optimal corruption level is obtained at 12%. Figure 12 shows the average response of the images in the hidden units. The impact of adding the corresponding penalties on the solution is judged from the perspective of sparsity. With a sparse penalty in the loss function, most of the useful information in the trained results is preserved, and the redundant features in the network, which are irrelevant to the current training task, are discarded. Adding the sparsity penalty can prevent the network from being overfitted and eliminate the redundant features in the network. In addition, it can reduce the  Figure 11. Visualization of the structures trained (10 Â 10 units in the result are selected) from the hidden layer with four different corruption levels: a) 0%, b) 5%, c) 12%, and d) 30%. Experiments show that although some details of the images are slightly lost at the 12% level, it can nevertheless obtain the useful feature representation.
www.advancedsciencenews.com www.advintellsyst.com computational complexity in the training of the network parameters and make the network more interpretable. The difference matrix computed by the first hidden layer response is used for detecting the loops, and a value lower than the threshold d indicates that a loop candidate has been detected. The similarity matrix, which is calculated by points and lines, is used to fuse the loops between these candidate frames and current frames. Figure 13 shows the comparison results of the feature graph and their differences in two scenes. In the same scene, the differences of the feature representation are very small, whereas in different scenes, the feature differences are relatively large.
To evaluate the performance of trained features for loop detection, the precision-recall curve is calculated. Precision-recall curves of two competing methods on the TUM RGB-D benchmark are shown in Figure 14. It can be seen that the performance of the DNN is better than that of the BoW. In practical scenes, once a loop is detected, its neighboring frames will also be detected with the current frame; therefore, a recall rate approaching 50% is sufficient. When the recall equals 50%, the accuracy of www.advancedsciencenews.com www.advintellsyst.com the DNN-based loop detection is about 10% higher than that of the traditional BoW. The sequence fr2/pioneer_slam in the TUM RGB-D benchmark is used to evaluate the performance of the loop detection. The estimated trajectory and the ground truth are compared in Figure 15. It can be seen that the proposed method accurately detects the loops of the scene, and the estimation trajectory agrees well with the ground truth.
The time of detecting the loops is another crucial indicator. Computed on a CPU of Intel Core i7-6600U@2.60 Ghz, the times for loop detection of the BoW and DNN are compared in Table 2, where the detecting time is the average time for detecting the loops in the sequence of fr2/poineer_slam inclusive of 2198 images. The detection process is to compare the current frame with all the keyframes, which are maintained in the reconstructed map. The time of the BoW method is 10.24 ms, whereas that of the DNN is 14.76 ms, which is only 4.53 ms longer than the BoW.

Localization
Localization is evaluated in terms of the accuracy and robustness of trajectory estimation. Based on the indoor datasets known as the TUM RGB-D benchmark, the proposed SLAM is compared with other visual SLAMs, including PTAM, LSD-SLAM, and ORB-SLAM, to evaluate their localization accuracies. The image sequences that are not suitable for monocular visual SLAM in the TUM RGB-D benchmark are removed, and consequently only ten relative sequences are selected. The absolute median RMS error is used as a metric for the comparison; it can be computed through a Python script provided by the benchmark. Before calculating the error, similarity transformation is applied to all the trajectories to align the data. For the sake of reducing the amount of calculation, only the keyframes extracted from the image sequence are calculated. Table 3 summarizes the results of the absolute median RMS errors of the camera trajectories executed over five times in each sequence. In all these ten indoor It can be seen that the performance of the DNN is better than that of the BoW. When the recall equals 50%, the precision of the DNN is 10% higher than that of the BoW.   scenes, the proposed SLAM exhibits a high accuracy in the trajectory of the camera compared with others. In particular, in the floor and office scenes containing abundant lines, the localization accuracy of the proposed method is much higher than that of the point-based ORB-SLAMs. By contrast, PTAM was unreliable in several scenes because it failed in six out of ten sequences. In addition, LSD, which is a direct-based method, also loses its trajectory in two scenes, and its localization accuracy is much lower than that of our SLAM. As for the outdoor performance, KITTI, an outdoor dataset with the ground truth, is used for the experiments. It contains 22 sequences collected from outdoor autonomous driving scenes. To highlight the performance of the SLAM after adding the lines into the features, four representative sequences 01, 03, 07, and 09 with sufficient lines are picked as test sequences. The trajectories obtained by the proposed SLAM, ORB-SLAM, and points and lines based stereo visual odometry (PLSVO) [20] are compared with the ground truth, as shown in Figure 16. The trajectories of our SLAM overlap better with the ground truth than those of the ORB-SLAM and PLSVO. Furthermore, the proposed SLAM exhibits good robustness of loop detection at the same time. As can be seen from Figure 16a, our SLAM works well in sequence-01 (highway), whereas the ORB-SLAM fails. This is because the number of points appearing on the highway scene is insufficient. On the contrary, our SLAM being added with the lines, the number of features available for matching and tracking increases substantially. Especially in the linear running stage, the estimated trajectory of our SLAM agrees well with the ground truth, since there is an adequate amount of lines. In addition, PLSVO, which adopts the violent matching between the extracted features in the visual odometry, suffers greatly from the accumulated errors due to the absence of loop detection. Though able to work in sequence-01, the trajectories of PLSVO deviate far from the ground truth as the distance increases. Table 4 lists the results of the absolute median RMS errors of keyframe trajectories in each sequence, from which it can be seen that the proposed SLAM shows better accuracy and robustness of localization than the traditional point-based ORB-SLAM

Conclusions
In this article, a visual SLAM is proposed in which both points and lines are extracted as features to estimate the trajectory of the camera and reconstruct the map of scene in real time. In addition, a deep neural network-an autoencoder-is adopted for the loop detection. The overall performance regarding the trajectory estimation and loop detection has been investigated using the TUM RGB-D (indoor) benchmark and KITTI (outdoor) datasets.
Our experimental results indicate that the proposed SLAM, compared to the conventional SLAMs, shows better accuracy and robustness of trajectory estimation than the traditional point-based or direct-based SLAMs, especially for the scenes with insufficient points. As for the loop detection, the DNN turns out to be superior to the traditional BoW, as it could decrease the accumulated error of the estimated trajectories and reconstructed scenes. Merits aside, we shall also mention that our SLAM can underperform in the presence of noise and non-well-defined geometries and even fail if both points and lines are insufficient. www.advancedsciencenews.com www.advintellsyst.com