Tracking‐DOSeqSLAM: A dynamic sequence‐based visual place recognition paradigm

Konstantinos A. Tsintotas, Laboratory of Robotics and Automation, Department of Production and Management Engineering, School of Engineering, Democritus University of Thrace, Xanthi, Greece. Email: ktsintot@pme.duth.gr Abstract Simultaneous localization and mapping (SLAM) refers to a process that permits a mobile robot to build up a map of the environment and, at the same time, to use it to compute its location. One of its most important components is its ability to associate the most recently perceived visual measurement to the one derived from previsited locations, a technique widely known as loop closure detection. In this article, we evolve our previous approach, dubbed as ‘DOSeqSLAM’ by presenting a low complexity loop closure detection pipeline wherein the traversed trajectory (map) is represented by sequence‐ based locations (submaps). Each of these groups of images, referred to as place, is generated online through a point tracking repeatability check employed on the perceived visual sensory information. When querying the database, the proper candidate place is selected and, through an image‐to‐image search, the appropriate location is chosen. The method is subjected to an extensive evaluation on seven publicly available datasets, revealing a substantial improvement in computational complexity and performance over its predecessors, while performing favourably against other state‐of‐the art solutions. The system’s effectiveness is owed to the reduced number of places, which, compared to the original approach, is at least one order of magnitude less.


| INTRODUCTION
Nowadays, robotics researchers have put a tremendous effort in developing methods to map the world through several exteroceptive sensors [1][2][3]; the reason is the usefulness of an appropriate representation of the surroundings for the robot to be able to perform more elaborate tasks such as path and task planning. It is common though that the use case the robot should deal with impels the map representation. Within the scope of simultaneous localization and mapping (SLAM) methods [4], the robot should estimate its pose as it navigates through the working field. The importance of an efficient and robust estimation is vital for accurate navigation to be achieved. Thus, SLAM is sine qua non in any contemporary autonomous system. The ability to detect and identify a location that has previously been observed is referred to as place recognition. Due to noisy sensor measurements or field abnormalities, drifts occur on the robot's generated map. Such cases are minimised and improved pose estimation is provided through the accurate detection of loop closures [5][6][7][8][9][10]. In many contemporary applications, such as in aerial or space robotics, computational resources are restricted. In such cases, efficient methods that provide low complexity, even at the expense of performance, are generally preferred [11][12][13][14][15].
Visual place recognition, the ability to identify a known location in the environment using vision as the main sensory input, is achieved using cameras that provide lowcost means for the generation of extremely rich and dense data [16]. Owed to the increased availability of computational power during the last years, cameras became the primary sensory unit in most autonomous mobile platforms where localisation is based solely on the appearance of the scene [17]. The main key components that constitute such a pipeline are the image processing module, the map and the belief generator module. During the agent's manner formulates the environment's map, which maintains the robot's knowledge about the explored world. Finally, the last module decides whether or not the agent re-encounters an already visited location. Depending on the way by which the system maps the environment, appearance-based systems are distinguished into two main categories, namely single-and sequence-based. Approaches belonging to the first category seek for the most similar location in the robot's traversed trajectory greedily, while methods of the second category search between submaps that is, groups of individual frames, defined as places. In both categories, the belief generator module performs data comparisons, aiming to get a score about image similarity based on the way the incoming images are processed. Sum of absolute differences (SAD), local feature vote density [18], bag of words (BoW) histograms [19] or feature vectors derived from convolutional neural networks (CNNs) [20] are the most common techniques. Thus, loop closure events are indicated by comparing similarities between such map representations.
SeqSLAM [21] constitutes one of the most recognised algorithm in sequence-based visual place recognition [22][23][24][25][26][27][28][29][30][31] exhibiting the system's performance improvement by comparing places to decide about its position into the world. Using a downsample scheme within the image processing module, along with the SAD metric, this framework achieves robust localisation through a nearest-neighbour distance ratio [32] technique. However, many challenges arise when breaking the map into places, including optimal submap size, submap overlap during database searching, consistent semantic map segmentation, data duplication and submap alignment [33].
To overcome these challenges, the majority of approaches, including SeqSLAM, use a predefined number of frames to break the map into places. Subsequently, through a sliding window scheme, these approaches look for every possible submap correlation. Although this technique improves the achieved performance, its functionality is computationally costly since the agent needs to seek and compare every possible group of images. As a result, the system's complexity increases since the comparisons are performed for images which might not exhibit the same semantics with their neighbouring ones. Having identified this drawback, in our previous work [34], we proposed an online sequence-based loop closure detection pipeline, wherein a feature matching technique between consecutive images provides the system with dynamically defined places. This approach searches into the traversed trajectory for similar group of images avoiding the sliding window approach of the initial framework, while its belief module is based on an average decision filter. The usage of dynamic submaps showed favourable performance against its predecessor; nevertheless, the extraction and matching process of local floating point features for each incoming camera measurement burdened the computational complexity. To build an efficient and independent from training procedure system, we evolve our preliminary framework presenting an appearancebased loop closure detection pipeline which relies on a dynamic place-to-place matching scheme. Local point extraction is performed on the perceived visual information, and through the Kanade-Lucas-Tomasi (KLT ) tracker [35], a new place, that is, a group of images, is defined when contained point tracking fails to advance in the following frame. This way the computationally costly local feature extraction and matching of DOSeqSLAM is avoided, while robustness is achieved regarding the place' size. Next, the camera measurements are subjected to the image processing module, used by both the initial approach and our previous work, where instances are downsampled and normalised. At query time, the latest constructed place seeks for the most similar candidate through the matching score produced by the nearest-neighbour distance ratio. Finally, the most suitable location is identified via an image-to-image association in the SAD domain avoiding the usage of the average decision filter of our preliminary work. The proposed framework is evaluated in seven different environments, while also compared with its ancestors and other state-of-the-art solutions.
The main contributions of this work are as follows: � A low-computational place recognition pipeline capable of detecting loop closure events through place-to-place comparisons using almost two orders of magnitude less operations than the initial approach. � A robust dynamic submapping for place definition based on the extension of the point tracking. Although this process could also be used to potentially detect the dataset keyframes, we define a place by all the frames belonging in the same submap, outlined by the point tracking. � An exhaustive experimental parameter evaluation scheme based on the number of extracted points.
The remainder of this article is organised as follows. Section 2 provides a brief review of appearance-based place recognition approaches. In Section 3, the proposed pipeline is described in detail, while Section 4 presents evaluation and experimental results. In Section 5, the conclusions are provided.

| RELATED WORK
The loop closure detection pipeline proposed in [19] belongs to single-based methods and makes use of the BoW model [36] for representing the incoming image through a pretrained visual vocabulary generated by SIFT descriptors [32]. Additionally, a Chow-Liu tree learns the co-occurrence probabilities among visual words [37]. An improved approximation of this work allows the system to scale by more than two orders of magnitude [38], while 3D information further enhances the system [39]. In [40,41], a binary vocabulary is proposed and utilised, which is accompanied by a geometrical verification step to enhance loop detection. To improve the speed of the matching process in methods based on binary vocabularies, an incremental feature-based tree is proposed in [42].
The latest works in visual loop closure detection are inspired by the great success of CNNs in several computer vision tasks [43][44][45]. These approaches address the place recognition task by using specific layers of the architecture to TSINTOTAS ET AL.
-259 represent an image and determine potentially revisited locations [46][47][48][49][50][51][52][53][54][55][56]. These layers are originally trained for object recognition; thus, they are tightly bounded to their learning example attributes. NetVLAD [49], an advanced version of VLAD (vector of locally aggregated descriptors [57]), which is commonly used for image retrieval, consists of two trainable end-to-end modules. The first is a CNN extracting the image features and the second is a fusion layer so as to form a descriptor that mimics the behaviour of VLAD. Improving upon CNN-based visual place recognition, Chen et al. [51] trained two neural networks on Specific Places Dataset (SPED), namely AMOSNet and HybridNet. The former is trained from scratch on SPED, while the latter uses the weights of the top 5 convolutional layers from CaffeNet [58] which is trained on ImageNet dataset [59]. CNN-based description of images that utilise only regions of interest (ROI) showed enhanced performance compared to whole-image descriptors. R-MAC (regions of maximum activated convolutions) [60] uses max pooling on cropped areas in CNN layer features to extract ROI. Khaliq et al. [53] combine VLAD with ROI to achieve robustness against appearance and viewpoint variations. However, CNNs require model training from large-scale labelled datasets from a multitude of environments, which is a practical limitation. Their intense computational nature constitutes a key limitation since higher run-time memory and feature encoding time are needed. Thus, despite the impressive results they produce, their high demand in computational resources makes these frameworks unsuitable for mobile robotic applications [61,62]. More specifically, the above limitations raise deployability concerns on resource-constrained platforms (including battery-powered aerial, micro-aerial and ground vehicles), as identified in [63]. Furthermore, their feature extractor is viewpoint dependent since the topological information is not provided. As a result, they remain incompatible with the majority of SLAM applications in mobile robotics without the utilisation of extra computational power.
While the aforementioned methods address the place recognition task as a single instance matching process, the sequence-based matching frameworks aim to take advantage of the additional information provided by a group of images in a scene. In [64], a sequence-based algorithm is proposed where the distance between local scenes is used in order to find statistically pairings between places. Similarly, in [65], the incoming visual sensory information is segmented into fix-sized groups of images and represented by a common visual-word histogram. Using a quantitative interpretation of temporal consistency, place-to-place matches that are coherently advancing along time are enhanced [66]. Group of landmarks, formulated through local feature covisibility, generate location graph models [67], while in [68], additional geometrical information from the observed environment structure is used in order to increase the performance. Vysotska et al. [69] build up a data association graph exploiting GPS information to find a sequence of matching images in an offline fashion. An extension of this work exploits hashing techniques to realise efficient re-localisation when the platform has left the previously mapped area [70].
Compared with the methods which are based on a pretrained description technique, incremental approaches 'learn' the environment during the navigation. Two visual vocabularies-one representing image descriptors and the other colour histograms-are generated online indenting to detect loop closures in a Bayesian filtering scheme in [71]. By following the incremental fashion in [72], a visual vocabulary is proposed where the words are generated using a modified version of agglomerative clustering. Since mobile robots have limited computing resources, an incremental loop closure detection approach for large scale and long term is proposed in [73]. Most of these frameworks tackle the loop closure detection task through the location polling of the distributed votes originated by the local feature descriptors. In [74], regions of high vote density are selected as loop closure candidates via a nearest neighbouring descriptor technique. The IBuILD algorithm [75] proposes a binary vocabulary wherein the extracted features are matched across consecutive images. In [76], images with similar visual properties are stored in groups formulating a hierarchical architecture of places, each of which is represented by a global descriptor. The method selects the candidate loop closing place through a query-todatabase comparison of the global descriptors. Subsequently, the most likely match is retrieved through an extensive search in the local feature space. A new approach was recently introduced by the same authors [77], in which dynamic islands were used to group the images based on spatiotemporal similarity. Probabilistic voting schemes utilise the number of aggregated votes in the database to compute a score that indicates previsited locations [18]. A dynamic sequence segmentation is performed based on the image content proximity, while a clustering technique generates the visual words assigned to these specific places [78]. Temporal and geometrical checks are also included for the sake of performance improvement in the incrementally constructed vocabulary of tracked words [79]. In a similar manner, a modified version of growing self-organising maps [80] is proposed by [81]. Using Gist features [82] for representing places, the map is incrementally learnt, while the most active neuron is selected as loop closure candidate during query.
Working with sequence-based methods, many robotics scholars evolve the well-known SeqSLAM [23][24][25][26][27]. Querying the traversed trajectory, via a Bayes filter, a subset of candidate places are indicated, and through an extended evaluation on the selected sequences, a faster version of SeqSLAM is provided [23]. Likewise, the authors in [24] propose an efficient version, titled 'Fast-SeqSLAM'. In this work, place matching is achieved using a histogram of gradients [83] to describe the downsampled images, along with a k-d tree [84] and a nearest neighbour classifier on the descriptor space. In the work of Wang et al. [25], a real-time framework is presented. The visual sliding window technique accompanied by odometry information provides loop closure candidates, while a multiscale search at the selected group of images indicates the best location match. By utilising the BoW model for representing images, SeqSLAM obtains robustness against scale and rotation variations, while the binary vocabulary generated by ORB descriptors [85] offers a low computational pipeline [26]. Dongdong et al. [27] proposed 'SeqCNNSLAM', wherein pretrained CNN output layers are utilised as image descriptors. Comparisons are performed among CNN feature vectors and sequence matches are accomplished by following the original version steps. In [30], a technique is proposed that combines properties from two existing visual place recognition methods, which do not depend on learning examples, that is, SeqSLAM and CoHOG [86]. A lightweight system wherein places are compressed and represented by compact codes is proposed in [29]. Finally, Chen et al. [46] used features from all layers of Overfeat Network [87] and integrated it into the spatial scheme of Seq-SLAM.
The majority of the aforementioned algorithms in sequence-and appearance-based place recognition relies on SeqSLAM coupled with a pretraining technique for describing the images or the addition of extra information along with the incoming visual sensory measurements. However, our previous work [34] focuses on the dynamic segmentation of the traversed path for defining places in order to avoid the sliding window scheme. Although our approach performs favourably against the initial pipeline, submaps generated via feature matching tend to lose their local feature coherence sooner than expected, while the system's complexity remains high since feature extraction is implemented for every image. Yet, the proposed framework improves on its predecessor by adapting a point tracking technique among consecutively acquired images to determine places, while keeping its original online behaviour independent from any training procedure.

| METHODOLOGY
In this section, an extended description of the proposed loop closure detection pipeline is presented. As mentioned previously, the algorithm formulates each place dynamically through the KLT point tracker. To carry out the submap definition, local keypoints are extracted from the camera measurements. Subsequently, the SeqSLAM processing steps are performed with the data being downsampled and normalised. Each image is compared to the database through SAD, and when a temporal constant is satisfied, the database is searched for a candidate place match. Since the main algorithm follows the initial approach, a brief description of our previous work is provided. An outline of the proposed visual place recognition workflow is shown in Figure 1. F I G U R E 1 An overview of the proposed sequence-based loop closure detection framework. As the incoming visual sensory information arrives (I P 1 ) to the pipeline, points are extracted via the speeded-up robust features (SURF) detection and description algorithm [88] each time a new place begins its generation procedure. Subsequently, the camera measurement follows the SeqSLAM's [21] processing steps where it is being downsampled and normalised before compared to the previously visited locations in the database. When the following image (I P ++ ) enters the system, points are tracked through the Kanade-Lucas-Tomasi (KLT ) method [35] to dynamically define places. Finally, when point tracking is lost and the temporal constant is satisfied, the database is queried with the last formulated place TSINTOTAS ET AL. -261

| SeqSLAM procedure
For each image I entering the system, the visual data is converted into the greyscale equivalent and then is downsampled into χ pixels. In addition, the resized ones are normalised in an N size local neighbourhood and comparisons with the traversed trajectory are performed by means of SAD: where Rx and Ry denote the reduced dimensions of the images, while ρ represents each pixel's intensity value. A vector D i for location i containing distance metric against every previsited one j is generated through SAD, resulting into comparison matrix D. Focussing on the comparison between sequences of images during the query procedure, a contrast enhancement process is performed on the D i elements, which is analogous to a 1D patch normalisation in a local area of ɛ pixels: where D ε represents the local mean and σɛ the local standard deviation around element μ.

| Dynamic sequences
In order to define a dynamic group of instances in DOS-eqSLAM, local keypoints are detected via the SURF method [88] from each incoming visual sensory data. Utilising the full space, projected SURF descriptors (d I ) are temporarily extracted during the online operation as long as the place construction lasts. Through a feature matching coherence check, new places are determined along the robot's traversed path. Additionally, in cases where the input camera measurement is unable to produce enough visual information, for example, the system observes a blank plane, the pipeline skips those images avoiding the construction of inconsistent places.
More specifically, at time t, the incoming image stream is segmented when the correlation between the last n image descriptors ceases to exist: where S j j denotes the cardinality of set S.

| Sequence-based search
In order to identify a previously visited location (database), searching is based on place comparisons. During the system's navigation, when the latest sequence Seq N is created, querying the database is performed for the first frame in the previous generated place Seq N−1 . A number of trajectories are projected on the enhanced distance matrix b D for every traversed location j. The trajectory lengths are proportional to the query place size. Each trajectory represents a possible velocity assumption corresponding to different robot velocities V. A number of multiple scores s do are calculated for every trajectory assumption by averaging the accumulated values: where I Q 1 and I Q end are the first and last image timestamps of the query sequence, respectively, seq Q Len is the query length and k denotes velocity assumption paths: where V is designated by multiple values within the range of [V min , V max ] (advancing by V step each time step t) and L represents the sequence length. The minimum score s do is selected for each instance in the navigated path, yielding an S do vector, wherein the lowest value is selected for the particular location I j . Subsequently, this score is normalised over the second lowest value outside of a window W DOSeq [34] resulting to γ. Lastly, an average weighted filter is applied for the final decision. A candidate loop closure sequence is determined when factor γ is satisfied and the system performs an additional greedy image-to-image search into the SAD submatrix for singleimage associations.

| Tracking-DOSeqSLAM
Feature tracking is essential for several high-level computer vision tasks such as motion estimation [89], structure from motion [90] and image registration [91]. Since the earliest works, feature trackers have been used as a de facto tool for handling points in a video. We have chosen to use a tracker based on a floating point, local feature detection and description algorithm during the navigation procedure. Through point tracking we achieve to dynamically segment the incoming visual stream and determine a place. This way, the computationally demanding procedure of feature detection and description for every incoming frame, which is used in our previous work [34], is avoided. . Points in I P 1þþ are browsed within three levels of resolution, around a 31�31 patch allowing our system to handle large displacements between frames. In such way, we achieve to generate robust places, even if occlusions occur due to moving objects, as evidenced by the experimental evaluation in Section 4.3. Furthermore, to propose a pipeline with low complexity, we avoid the computation of bidirectional error between points. In addition, as the algorithm progresses over time, points tend to gradually be lost due to lighting variation or out of plane rotation. At time t, when every point repeatability expires, the previous visual sensory stream I (t − n) , …, I (t − 3) , I (t − 2) , I (t − 1) , is determined as a new place: F I G U R E 2 In order to define places in the proposed pipeline, local keypoints are extracted via the speeded-up robust features (SURF) [88] detector for the first image of each place (grey circle). Through the Kanade-Lucas-Tomasi (KLT) method [35], points are tracked along the traversed path, while a new place is determined when each tracked point is lost. The query process begins when a temporal window based on a time constant and query length is satisfied (beige and light orange area). The latest generated place (light blue area) seeks for similar places along the navigated path in a sequence-based scheme. Each visited location (j ) is associated with a score (s j ) which indicates the nearest neighbouring trajectory assumption (yellow dashed line). The selected images (green box in S vector) point out the proper place and an image-to-image search is subsequently performed

Algorithm 1 Place definition
Finally, two important components are retained during navigation: (i) the place index P and (ii) its length L P . Algorithm 1 summarises this process.

| Image modulation
Next, the pipeline follows the process described in Section 3.1.1 in order to keep the initial SeqSLAM' characteristics. Incoming frames are scaled down to χ pixels and are then normalised. Comparisons between the query, that is, the current robot view, and the traversed locations are achieved via SAD, yielding to generation of distance matrix D. However, the contrast enhancement step is omitted in the proposed pipeline since it constitutes an essential component when the system confronts changing environments [66].

| Place-to-place association
When a place is determined, the query procedure starts. With the aim to perform reliable searching for similar submaps, the newly generated place P Q should not share any common semantic information with the recently visited locations. This is due to the fact that a set of input frames obtained during a short time interval before I t are expected to be similar without corresponding to actual loop closure events. To prevent our pipeline from detecting such cases, we consider a temporal window t W , which rejects locations visited just earlier (I Q 1 , …, I Q 1 -t W ). We define this window based on a temporal constant ψ and the place length L P : This way, the searching area spans among the first perceived location I 1 and the one determined by the temporal window I Q 1 − t W as depicted in Figure 2 by the red dashed line. The latest produced submap seeks into the navigated path for similar places via a sequence-based technique. For each database location I j , belonging to the searching area, a difference score s is calculated (Equation (4)) for each velocity assumption (Equation (5)). These scores are based on the values the trajectory line passes through in travelling from I Q 1 to I Q end (Figure 2). The trajectory with the minimum s value is selected as the representing score s j between the query place and the one starting from frame I j . When all database images have been examined, a score vector is determined S = {s 1 , s 2 , …, s I Q 1 −t W } and subsequently the minimum value is selected corresponding to the start location Iid of the candidate submap. Next, following the nearest neighbour distance ratio [32] the chosen score is normalised over the second lowest score (Figure 2) outside of a window range of equal size with the place length L. The normalised score, which is the ratio between these scores, is calculated for each place, while one of the following conditions has to be satisfied before a submap is recognised as previsited. The recent score has to be lower than a threshold σ (σ < 0.7 [21]) or the score generated by the last two consecutive submaps to satisfy a threshold λ. This temporal consistency check is incorporated in the proposed pipeline since loop closure detection is a task submitting to a temporal order of the visited places along the navigation route. That is, if a place is identified as previsited, then it is highly probable that the following ones have also gone through. This way, we achieve to improve the system's performance, while we avoid to lose actual loop detections due to strict thresholding. Algorithm 2 illustrates this process.

| Local best match
Up to this point, the proposed algorithm is capable of identifying a pre-visited place in the navigated map. Finally, an image-to-image correlation is performed between the query locations and the most similar members of the selected submap in the database. Hence, each place member is associated with the most similar from the corresponding ones in the matched database image through the SAD submatrix. Let us consider that at time t, the system correctly indicates a previously visited place by matching pair〈I P Q 1 ; I id 〉 . Our method defines a group of images which are the only set of database entries that are going to be evaluated through SAD metrics. In this paper, we determine this group to be of double the size of 264 -TSINTOTAS ET AL. camera frequency κ, while it is centred around I id for I P Q 1 , that is, I (id − 1) − κ , …, I (id − 1) + κ . However, for the following image in the query place I P Q 2 this area shifts by 1.

| EXPERIMENTAL SETUP
This section provides a description of the experimental procedure, an expansive evaluation of the proposed pipeline, as well as comparative results. A total of seven publicly available datasets are selected for assessing our method. The present approach is compared in terms of precision-recall metrics [38] with our previous work, the baseline version of SeqSLAM as well as other well-known place recognition solutions. All experiments were performed on an Intel i7-6700HQ 2.6 GHz processor with 8 GB of RAM.

| Datasets
The chosen environments represent outdoor, static and dynamic areas containing mostly urban views. In addition, a variety of different measurement properties are selected, for example, robot velocity, image resolution and frame rate, in order to examine the system's adaptation in different conditions. In Table 1, a summary of each data sequence used is provided. Three out of seven datasets belong to the KITTI visual collection [92], representing urban environments that mostly consist of houses, cars and trees. The perceived visual sensory information is obtained by a stereo camera system mounted on a car, while the recorded data offer considerable loop closure events, in addition to accurate odometry and high resolution images. Furthermore, image sequences 00 and 02 are selected in order to examine the algorithm in long-term operations, since the vehicle traverses a distance over 11 km. Lip 6 Outdoor [71] provides information perceived via a handheld camera encountering mostly buildings, while a high amount of loop closures along the navigated path are presented. The particular dataset is chosen to test the robustness of the system since it includes variations in the orientation and velocity of the incoming image stream, as well as low camera resolution and frame rate. City Centre [19] and New College [93] have been registered by the vision system of a robotic platform. They refer to significantly different operational conditions (e.g. travelled distance, frame size, acquisition frequency, camera orientation), as presented in Table 1. However, they both contain a significant amount of loop closure examples. Note that the acquisition frequency of New College was resampled to one frame per second, from its initial 20 Hz rate, due to the robot's low velocity and high camera frequency. In Malaga 2009 [94], the Parking 6L (Malaga 6L) data sequence was selected. This environment mostly contains cars and trees, while the camera information is provided by means of a vision system mounted on an electric buggy-typed vehicle. Plenty examples of revisited locations are

Malaga 6L
F I G U R E 3 Precision-recall curves of the proposed pipeline evaluating the utilised number of extracted speeded-up robust features (SURF) [88] ξ against the previous approach [34] and the baseline solution of SeqSLAM [21]. Experiments are performed on the KITTI 00, 02 and 05 data sequences [92], Lip6 [71] Outdoor, City Centre [19], New College [93] and Malaga [94] 6L. As the number of detected points increases, our proposed system presents a slight improvement, reaching recall values of about 77% in case KITTI 00%, 85% in KITTI 02% and 56% in KITTI 05. In Lip6 Outdoor, a score of 50% is achieved, while a similar performance is observer for the rest datasets. However, the performance falls drastically when feature extraction exceeds the amount of 500, points as evidenced in KITTI 05 and New College. This is mainly owed to the resulting size of the generated submaps which fail to be matched with the ones in the traversed trajectory. The proposed system offers higher performance against its predecessors for the rest of the evaluated datasets also presented. The incoming visual stream in most sequences is provided by a stereo camera rig; however, since our approach aims to an appearance-based pipeline, only the monocular capture was used. For City Centre, New College and Malaga 6L, the right visual stream was selected, while for the KITTI sequences the left one.

| Evaluation protocol
In this section, an evaluation protocol for the proposed framework is presented in detail. Precision-recall metrics along with the ground truth (GT ) information are utilised in order to assess the algorithm performance. Comparisons were performed based on the parameters in Table 2. Those values remain constant for every tested environment, so as to prove the adaptability of the algorithm. It is notable that the proposed approach is able to achieve high recall rates for 100% precision than any of its predecessors on most of the evaluated datasets.

| Parameter discussion
In this section, we briefly discuss the system's chosen parameters. In general, most of the proposed values, for example downsampled image size χ, image reduced size R x , R y , are defined similarly to the initial version of SeqSLAM [21]. Velocity properties [V max , V min , V step ] come from the open source implementation of OpenSeqSLAM 1 [22], while the normalisation parameter N is defined based on the open-SeqSLAM2.0 MATLAB toolbox 2 [95]. Extracted SURF points ξ defined via the precision-recall metrics in Figure 3 with the aim to achieve a framework exhibiting high performance.

| Ground truth
The binary matrix whose rows and columns correspond to different timestamps indicating the actual loop closure events occurring in a dataset is defined as ground truth. The presence of an GT ij = 1 element denotes the existence of a loop and GT ij = 0 otherwise. For the KITTI 00, 02, 05 and New College data sequences, the GT was manually generated in [79] through odometry information. In Lip6 Outdoor, this information is provided by the authors in [71]. Similarly, City Centre contains its own GT, while Malaga 6L was manually labelled by the authors in [52].

| Precision-recall metric
A true-positive detection concerns the correct match as indicated by the GT. As a correct match is considered any recognition occurs within a small radius from the query location. On the contrary, as false-positive detection is defined any identification occurs outside of this area, while falsenegative detections are the ones that the loop closure detection system ought to have identified but failed to. The tolerance used for the evaluation is 40 m. Thus, precision is the ratio between true positives over the total system's detections: whereas recall is defined as the number of true positives over the sum of loop closure events contained in GT: Recall ¼ True positive True positive þ False negative : ð9Þ TA B L E 3 Recall rates at 100% precision: a comparison of the proposed method against our previous work [34], as well as the baseline approach of SeqSLAM [21]. Bold values indicate the maximum performance per evaluated image sequence. As shown from the obtained results, the proposed pipeline outperforms the previous versions, while performance improvement is observed as the extracted set of points increases until a certain point. Aiming for an efficient system which preserves high recall scores for 100% precision, the case of 500 points is indicated.

| Performance evaluation
By altering the loop closure decision parameter λ, precisionrecall curves are monitored for different cases of image keypoints detection (ξ = 100, 300, 500, 700) in Figure 3. The system's performance for the proposed dynamic place generation is evaluated and compared with the previous version of DOSeqSLAM, as well as the baseline approach of SeqSLAM. The latter is based on the opensource implementation of OpenSeqSLAM, while configured through OpenSeqSLAM2.0 toolbox [95]. The selected parameters remained constant over all datasets. However, aiming to a fair performance evaluation, the contrast enhancement step was avoided for both previous methods of SeqSLAM and DOSeqSLAM. Furthermore, a 40 s temporal window, similar to the proposed method, was applied to reject early visited locations. For an easier understanding of the curves, best results at 100% of precision are also presented in Table 3. Our first remark is that the area under the curve of Tracking-DOSeqSLAM is higher than the corresponding curves for its predecessors, outperforming them in most of the evaluated datasets. As can be observed, DOSeqSLAM is usually able to obtain similar recall at perfect precision as SeqSLAM, except for New College, where the result drops to a rate of 17%. According to our experiments, the proposed pipeline shows especially high performance for Lip6 Outdoor, City Centre and Malaga 6L, for each case of the extracted keypoints, compared to the other solutions. Furthermore, the maximum scores for the other datasets are also high, while a high improvement is observed in KITTI 02 for a number of 300 keypoints, reaching a score of about 85% for perfect precision.

F I G U R E 4
Submaps generated from the proposed dynamic segmentation of the incoming image stream using the parameters defined in Table 2 [88] is detected in the first location of a newly formulated place (image 3547) and subsequently tracked along the trajectory. At time t (image 3583), the incoming visual sensory stream I (t − n) , …, I (2) , I (1) , I (t) , is finalised as a new submap since all the initial points cease to exist from the tracker Nevertheless, counter to most data sequences, where the increased keypoint extraction improves the performance, the evaluation of the proposed method in KITTI 05 and New College shows an instant drop in the recall rate. In the latter case, we observe a lower recall score, while in the former one our method does not recognise any previsited location. This is owed to the fact that places which are generated under those conditions fail to be matched with the query ones due to their extreme size. By considering the results presented in Table 3, the parameter ξ is selected at 500 in order to ensure a system that achieves high recall scores for 100% precision. Figure 4 shows the submaps formulated by Tracking-DOSeqSLAM for each dataset, while Figure 5 presents the detected loops for 100% precision. For each submap, a random colour has been assigned to highlight a distinct place across the traversed trajectory, and thus, every location associated to the same submap is labelled by the same colour. An example containing images from the same place defined by our algorithm based on point tracking is illustrated in Figure 6. Evidently, as soon as the robot turns to a route that represents a visually consistent area, the corresponding images that exhibit time and content proximity are aggregated in the same group (place). Finally, in Figure 7, some accurately detected locations using the selected parameterisation are shown. F I G U R E 7 Some example images that are correctly recognised by our pipeline as loop closure events. The query frame is the image recorded by the vehicle at time t, whereas the matched frame is the corresponding one identified among the members of the chosen place. From left to right: Lip6 Outdoor [71], New College [93], City Centre [19] and KITTI 02 [92] TA B L E 4 Processing time per image (ms/query) of Tracking-DOSeqSLAM, as well as of its previous version [34] and the baseline approach [21], for the KITTI 00 data sequence.It is notable that the proposed pipeline requires less time due to its efficient matching process which is based on the image aggregation from the generated places.

| System's response
To analyse the computational complexity of the proposed method, we ran each framework that is, SeqSLAM, DOS-eqSLAM and Tracking-DOSeqSLAM on the KITTI 00 image sequence, which is the longest among the evaluated ones exhibiting a remarkable amount of loop closures. In

| Comparative results
This section compares the proposed pipeline with other wellknown state-of-the-art algorithms. Firstly, since Tracking-DOSeqSLAM is an evolution of our previous work [34] and SeqSLAM [21] as well, we present in Table 5 the final amount of generated places and the average processing time for each method. In this regard, we aim to show that the proposed modifications result in an improvement in terms of processing time and computational complexity. As the proposed method follows the baseline approach regarding the main processing steps (e.g., image downsample, comparison technique, etc.), the computational complexity mainly depends on the number of constructed places. As highlighted in Table 5, our system achieves the generation of an amount of places at least one order of magnitude less than SeqSLAM, while a significant decrease is also presented against our previous one. This results in notably fast associations between similar submaps permitting our method to process in lessen time in contrast to the other previous versions, while presenting high recall scores for perfect precision. Furthermore, the impact in terms of recall is high and outperforming its predecessors in most of the tested datasets. In addition, for the sake of completeness, we show the results of other modern methods with the aim to help the reader to identify the place of the proposed pipeline within the state-of-the-art. In Table 6, we compare our approach with well-known works in place recognition, namely FAB-MAP [19], DBoW2 [40], BoW-SeqSLAM [26], FILD [52] and Kazmi and Mertsching [81]. The maximum recall scores for perfect precision for each approach are based on the figures reported in the original papers. The term N/A denotes that the corresponding information is not available from any cited source. Furthermore, for the case of FAB-MAP 2.0 [38] and DBoW2 [40] along with FILD [52] where no actual measurements are provided regarding the used datasets, the presented results are obtained from the setup described in [55,81], respectively. Most of the approaches (e.g. FAB-MAP, DBoW2, BoW-SeqSLAM) are based on pretrained visual vocabularies, while FILD uses deep features in order to represent the incoming sensory measurement. Albeit the proposed system achieves high recall rates in every tested dataset, the difficulty to present higher scores against recent loop closure pipelines which utilise more sophisticated image processing techniques for the location representation is evident. This is owed to the inability of SAD to quantify the obtained frames visual properties. However, our key purpose is to demonstrate the achieved performance gain, over the original SeqSLAM versions, through a refined trajectory segmentation, while operating with the lowest possible complexity and avoiding any training procedure. Thus, a direct comparison of Tracking-DOSeqSLAM with the rest of the approaches is not informative; it is only included here as a performance indicator to better interpret the possible improvement margins. On the support of thereof, in Table 7, we compare the average execution time of the proposed framework with the baselines on three representative datasets. The time for each approach is based on the reported values presented in the aforementioned sources. It is noteworthy that the proposed pipeline can achieve the lowest timings in every tested dataset.
In KITTI data sequences, the proposed algorithm performs unfavourably against the other solutions. However, despite FILD achieving the highest recall rates, this method is computationally intensive since a graphics processing unit was used to extract deep features making it unsuitable for mobile robotic platforms. Moreover, SURF are used for verifying candidate pairs though the RANSAC technique, which is well known for its high complexity and ability to reject outliers. In a similar manner, BoW-SeqSLAM and Kazmi and Mertsching exploit the epipolar geometry between the chosen images to further enhance the system's performance. When comparing the Lip6 Outdoor, the proposed pipeline exhibits over 40% of recall results outperforming the other methods. In the case of City Centre, New College and Malaga 6L, our algorithm drops, yet it retains better recall scores than its predecessors, while keeping the lowest complexity.

| CONCLUSIONS
The article in hand extends our previous work [34], presenting an appearance-and sequence-based loop closure detection method, which makes use of KLT tracking in order to efficiently fragment the robot's map into submaps defining dynamic places, dubbed as Tracking-DOSeqSLAM3. Following its ancestor's image representation and similarity comparison processes, the proposed pipeline highlights the system's ability to recognise previsited places using almost two orders of magnitude less operations. This way an efficient framework for autonomous robots with restricted computational resources is achieved. When the proper place is selected, an image-to-image search in the SAD domain determines the appropriate location. The system retains its ability to perform robustly against different operational conditions and works online without any training procedure. Compared with the initial version, the proposed approach achieves high recall rates for perfect precision in the most of the tested publicly available datasets, while still retaining a real-time performance.