Real-time long-term tracking with reliability assessment and object recovery

In recent years, many visual tracking algorithms based on discriminative correlation ﬁlters have been proposed and proved to be successful in short-term tracking. However, most algorithms do not handle long-term tracking well due to factors such as occlusion and deformation. The aim of this paper is to propose a long-term tracking method with reliability assessment and object recovery. First, the relationship between the overlap rate and the response value is extensively studied, and then from the perspective of the time axis, the tracking process is evaluated by the ﬂuctuation trend of the continuous response value. In the object recovery mechanism, we propose to alternately use local search and global search to improve the efﬁciency of detection. To this end, a sliding window is designed for cyclic shifting in the local region to achieve dense sampling within the region of interest, and EdgeBox is used in the global search to achieve target detection. Further, ﬂexible switch-ing between local search and global search is achieved by the difference in displacement of the object. Extensive experimental results on several benchmark datasets demonstrate that the proposed long-term tracker can achieve state-of-the-art accuracy with real-time speed of about 27 frames per second.


INTRODUCTION
Visual object tracking has been applied to many fields, such as human-computer interaction, security and surveillance, and auto-control systems. Existing object tracking algorithms have made great achievements, but due to factors such as illumination variation, occlusion, deformation, scale variation and background clutter, there are still many problems to be further investigated.
According to the appearance model, visual object tracking approaches can be broadly classified into two categories, namely generative and discriminative methods. The generative methods [1][2][3][4][5][6][7][8][9] transform the tracking problem as a nearest-neighbor searching task for the target model. The insufficiency of the generative method is that the background information of the target is not considered, so the image information cannot be applied well. The discriminative methods [10][11][12][13][14][15][16][17][18][19] formulate the target tracking as a binary classification problem. The extracted target and background information are used to train the classifier and This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2021 The Authors. IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology then separate the target from the background. Since the background information is additionally utilized, the discriminative methods will be more advantageous than the generative methods in practical applications.
Recently, correlation filter based trackers have sparked a lot of interests due to their real-time speed and state-of-the-art accuracy [11,17,18,20,21]. In general, these trackers learn a correlation filter online to localize the object in consecutive frames. However, the correlation filter based tracker still faces two serious problems. One major issue is boundary effects due to the circulant assumption. The other is that the target search region only contains a small local neighbourhood to limit drift. Therefore, the trackers easily drift in cases of fast motion, occlusion or background clutter. It is important to note that occlusion can have a serious adverse effect on the performance of trackers. When occlusion occurs, the similarity of desired target region degrades and cluttered background distracts the template to match with other mistaken regions. To make it worse, the template is updated with the false information, which further brings unsatisfactory templates to the subsequent frames. Therefore, how to overcome the performance degradation caused by occlusion is a key concern when designing a longterm tracker.
The aim of this paper is to develop a long-term tracker based on correlation filters to alleviate the problem of tracking failure caused by factors such as deformation and occlusion. The main difficulties in the overall framework of the tracker are reflected in the process evaluation and object recovery. To this end, we designed a criterion to achieve an accurate assessment of the reliability of the tracking process, and also established a complete object recovery mechanism. The main contributions of this paper are as follows: • Different from the methods using peak to sidelobe ratio (PSR) [20], median flow (MF) [59] and average peak-tocorrelation energy (APCE) [49], we regard the reliability of the tracking results as correlated back and forth on the time axis, and design a new tracking result evaluation criterion based on continuous response values.
• We propose a new object recovery mechanism in which the sliding window performs a local search and the Edgebox [61] performs a global search. The recovery mechanism can adaptively select an appropriate search mode to find the missing object based on the average displacement of the object.
• The proposed algorithm is validated on several widely used large-scale benchmark datasets [22,[24][25][26] and the experimental results show that our long-term tracker is able to achieve stateof-the-art accuracy and real-time speed.
The remaining of this paper is organized as follows. Previous related works are reviewed briefly in Section 2. Section 3 presents our approach for robust object tracking which includes short-term trackers based on correlation filtering, reliability assessment of tracking process, object recovery mechanism and model updates strategy. Section 4 summarizes the overall workflow of the algorithm and illustrates the performance of the proposed tracker. Finally, the conclusions and future research directions are discussed in Section 5.

Tracking-by-detection methods
Detection-based tracking framework has attracted widespread attention due to the fact that classifiers are suitable for online learning, such as support vector machines (SVM) [27], Random Forest classifiers [28] or boosting variants [29,30], and can be used to achieve real-time tracking. For each frame, a set of positive and negative training samples are collected for incrementally learning a discriminative classifier to separate a target from its background. However, extracting samples to learn online classifiers can cause sample ambiguity problems, while slight inaccuracies in labelling samples can affect the classifier and gradually lead to tracker drift. In order to alleviate the problem of model updating caused by sample ambiguity, considerable efforts have been made in previous studies, such as multiple instance learning (MIL) [29], Struck [31], semi-supervised learning [30,32], and P-N learning [33]. In addition, a robust appearance model containing both holistic templates and local representations was proposed in [34] to handle drastic appearance changes. Similarly, a learning framework in [35] with 1-class SVM and structured output SVM was designed for collaborating a descriptive component and a discriminative component for tracking. Instead of learning only one single classifier, multiple classifiers with different learning rates were constructed in [36] to address the model drift problem. By alleviating the problem of sample ambiguity, above methods have achieved considerable performance on the recent benchmark dataset [22]. Wang et al. [23] proposed a three-step procedure to convert a general object detector into a tracker. They considered the tracking problem as a special type of object detection problem, which called instance detection.
With proper initialization, a detector can be quickly converted into a tracker by learning the new instance from a single image.

Correlation filter-based methods
High computational efficiency and state-of-the-art accuracy are two significant advantages of the correlation filter based tracker. By transferring the model into the Fourier domain, the matrix algebra can be solved by element-wise operation.
In [20], Bolme et al. proposed the minimum output sum of squared error (MOSSE) method and learned a minimum output sum of squared error filter on gray-scale images, which used intensity features to represent the object. Heriques et al. [17] proposed the circulant structure of tracking by detection with kernel (CSK) algorithm which generated correlation filters by exploiting the circulant structure of training samples and transferred filters into the Fourier domain. In their work, the CSK achieved a highest tracking speed with considerable performance on benchmark dataset. As an extension to CSK, the histogram of oriented gradients (HOG) features, Gaussian kernels, and ridge regression were used in the kernelized correlation filter (KCF) [18]. The discriminative scale space correlation filter tracker (DSST) [13] learned discriminative correlation filters with a scale pyramid representation to handle the scale change of target objects. They learned two kinds of filters, one for translation estimation and the other for scale estimation. Besides, the scale adaptive with multiple features tracker (SAMF) [14] estimated the scale variation by training the KCF to search the target around the latest estimate position on various sizes of the image patch. Huang et al. [37] proposed the aberrance repressed correlation filter. By introducing a regularization term to restrict the response map, their method was capable of suppressing aberrances that was caused by both background noise information and appearance changes of the tracked objects.
Hand-crafted features have been widely used in correlated filter based tracking, and deep features have recently begun to be combined with correlated filtering to exploit their enormous potential. Following the end-to-end philosophy, the recent work [38] integrated DCF into the Siamese framework [39], and [40] employed DCF as a one-layer convolutional neural network (CNN) for end-to-end training. Some other DCF methods [41][42][43] focused on integrating convolutional features from a fixed pre-trained deep network. Ma et al. [41] proposed a hierarchical ensemble method of independent DCF trackers to combine multiple convolutional layers. Danelljan et al. [42] proposed the continuous convolution operator tracker (C-COT) to efficiently integrate multi-resolution shallow and deep feature maps. The efficient convolution operators (ECO) tracker [43] further optimized the speed and accuracy of the C-COT tracker at three levels: model size, training set size, and model update. Due to the high complexity of CNN features, trackers with deep-features are not superior in speed to traditional handcrafted feature trackers. Sun et al. [44] proposed the regionof-interest (ROI) pooled correlation filters for visual tracking. They showed that the ROI-based pooling can be equivalently achieved by enforcing additional constraints on the learned filter weights, which made the ROI-based pooling feasible on the virtual circular samples.

Long-term tracking methods
Correlation filters with short-term memory have significant limitations in dealing with object occlusion and disappearance. Therefore, many methods aim to make correlation filters with long-term memory to achieve more advanced tracking. The combination of a short-term tracker and detector is a commonly used long-term tracker architecture which was first used in tracking-learning-detection (TLD) [33]. The seminal work of [33] proposes a memory-less flock of flows as a short-term tracker and a template-based detector. Another paradigm was pioneered in [45], where the authors casted localization as local keypoint descriptors matching with a weak geometrical model. Ma et al. [46,47] proposed using KCF as a short-term tracker and random fern classifier as a detector to construct a longterm tracker. Similarly, Hong et al. [48] combined a KCF tracker with a SIFT-based detector which was also used to detect occlusions. Dai et al. [49] established a simplified long-term tracker by employing assistant re-detection, which combined the shortterm components with the SVM classifier in constructing a long-term component. In [4], the target model was decomposed into a grid of cells and an occlusion classifier was learned for each cell. In [50], a Bayes filter was developed for target loss detection and re-detection for multi-target tracking. Alan et al. [51] utilized the novel DCF constrained filter learning method to design a detector that was able to re-detect the target in the whole image efficiently. Moreover, they proposed to use the correlation response value and PSR to evaluate tracking failure. Sun et al. [52] treated the filter as the element-wise product of a base filter and a reliability term to mitigate model degradation. Dai et al. [53] proposed an adaptive spatial constraint correlation filtering algorithm to optimize both filter weights and spatial constraint matrices. In addition, Goutam et al. [54] leveraged the complementary properties of deep and shallow features to improve both robustness and accuracy. Their work has played a significant role in unlocking the potential of deep features for tracking. Recently, some methods have tried to combine deep learning-based proposal networks with correlation filter tracking. For example, Yang et al. [55] proposed a novel long-term correlation filtering tracking method that applied a DCF-based tracking model with a novel target-aware detector in a collaborative way. In their method, they evaluated the reliability of the tracking results based on the enhanced PSR, and used region proposal network (RPN) based detectors to perform object detection under tracking failure. Yan et al. [56] proposed a novel robust and real-time long-term tracking framework based on the proposed skimming and perusal modules. The perusal module aims to precisely locate the tracked object in a local search region using the offline-trained regression and verification networks. The skimming module focuses on efficiently selecting the most possible regions from densely sampled sliding windows.

The basic correlation filter tracking component
where f (x) = w T x is used to minimize the squared error over samples x i and their regression targets y i . In the case of nolinear regression, kernel trick, f (z) = w T z = ∑ n i=1 i (z, x i ), is applied to allow more powerful classifier. For the most commonly used kernel functions, the circulant matrix trick can also be used [18]. The dual space coefficients can be learnt as below:̂ * =ŷ k xx + where k xx is defined as kernel correlation in [18]. In this paper, we adopt the Gaussian kernel which can be applied with the circulant matrix trick as below: where  −1 denotes the inverse FFT transform. The patch z at the same location in the next frame is treated as the base sample to compute the response in Fourier domain, wherex denotes the data to be learnt in the model. When we transform ‚ f(z) back into the spatial domain, the translation with respect to the maximum response is considered as the movement of the tracked target. For the scale variation of the objects, we apply the method of the scale pool [14]. The template size is fixed as s T = (s x , s y ), and the scaling pool is defined as S = {t 1 , t 2 , … , t k }. Suppose that the target window size is s t in the original image space. The samples can be resized into the fixed template size s T by using bilinear-interpolation, and the final response is calculated by where z t i is the sample patch with the size of t i s t , which is resized to s T . To adapt to changes in the appearance of objects, we linearly combine the new filter with the old one as below: where T = [ T ,x T ] T is the template to be updated. It is worth noting that T is very useful for object recovery because it contains the most recent appearance information of the object.

Reliability assessment of tracking process
Accurate assessment of the tracking process is an effective guarantee for the intervention of object recovery mechanism. Since the correlation filtering tracking locates the target according to the response map, the intensity of the response can be used to characterize the accuracy of target positioning. Figure 1 shows the fluctuation trend of the response value during object tracking. To explore a good tracking process evaluation method, we use the groundtruth given by the dataset to investigate the distribution of response values at different overlap rates. The overlapping is defined as indicates the predicted bounding box and groundtruth, respectively. Since the tracker predicts the new position based on the response map, we can obtain not only the predicted bounding box B 1 but also the response value corresponding to B 1 . For this reason, we can get the distribution of response values at different overlap rates. In the one-pass evaluation [22,24], for the tracking result of each frame, when O(B 1 , B 0 ) ⩾ 0.5, it is considered that the tracker works well. According to the visual sense, when O(B 1 , B 0 ) = 0.5, the bounding box still falls on the tracked object. However, as the overlap rate decreases, the bounding box will gradually deviates from the object. When the overlap rate is about 0.4, the tracker almost drifts, and when the overlap rate drops to about 0.2, the tracker completely loses the object. Therefore, we can investigate the relationship between the response values and the reliability of the tracking results by examining the distribution of response values at different overlap rates.
Considering that correlation filtering based tracking is based on the result of the previous frame to predict the state of the object in the next frame, the reliability of the tracking process can be regarded as being correlated back and forth on the time axis. Therefore, we observe the fluctuations of the response values = {R K −4 , R K −3 , R K −2 , R K −1 , R K } of five consecutive frames to evaluate the reliability of the tracking results. To this end, we design two conditions as follows where R K is the response value corresponding to the K th frame and is the response value of the initial frame. We regard the second frame as the initial frame because the response value of the first frame is always 1. As for the second frame, since it reflects the state of the target in the original environment, the response value of the second frame has a high reference value. The operator sum is used to calculate the number of elements in the set whose response value is less than ⋅ d ⋅ . d and are constraint coefficients less than 1. Equation (7) is mainly used to find the time period when the response value drops significantly, and then Equation (8) is used to further determine whether there are two or more drastically decreased response values. When Equations (7) and (8) are reached, the object recovery mechanism will intervene immediately.

Template pool
During the tracking process, the appearance of the object changes over time. An important issue in long-term tracking is how to make the tracker have the ability to recover the object successfully in the case of tracking failure caused by significant changes in the object appearance. However, using the

FIGURE 2
The schematic diagram of the local search strategy target sample of the initial frame as a fixed template is not well suited for long-term tracking because the fixed template cannot be adapted to the appearance changes of the target during the tracking process [39]. In [60], the short-term classifier-pool uses the last M samples as templates and partially occluded objects are also used as samples, so the classifier-pool is to some extent contaminated by samples containing error information. Different from the methods in [39] and [60], the availability of each template will be examined by its corresponding response value when building a template pool in this work. Assuming that the object recovery mechanism intervenes at the K th frame, the template pool is constructed as a collection of the last 15 frame templates.
where T K is the template of the target at the K th frame. According to Equations (4) and (5), the template T i with the largest response value in can be determined. The template T i will be used to recover objects from tracking failures.

Local search strategy
The boundary effect is a significant flaw in correlation filter tracking, and the search area is usually set to 1.5 times the size of the object. When the object part moves out of the search area, the positioning of the target will be inaccurate. In particular, fast-moving objects tend to move out of the search area easily, which eventually leads to tracking failure. Therefore, a search area of 1.5 times the object size is usually insufficient. For this reason, SRDCF [21] limits the boundary effect and expands the search area to 4 times the size of the object, but greatly sacrifices the running speed. Generally, when the object appears again after losing, it is usually located in the surrounding area of the losing position. Therefore, the local search method can detect the object in the surrounding area of interest directly with high efficiency. In order to improve the tracking speed and accuracy, we propose a local search strategy which is activated after Equations (7) and (8) are reached. When tracking fails, the tracker will record the position of the object at the current frame, and then the detector searches the target in the local region of the position. To this end, we designed a sliding window to perform a cyclic shift in the local region for dense sampling while extracting image features to generate a sample corresponding to each window. This is equivalent to dividing the expanded search area into many blocks and performing correlation measurement on each block separately, which can effectively reduce the influence of boundary effects. The specific search strategy is shown in Figure 2. The red dot is used to represent the center coordinate of the object's bounding box. When the criterion is triggered, the tracker will record the coordinates of the red dot. The red box indicates the search region centered on the red dot, where W search is the width and H search is the height. The green box with a green dot is a sliding window that will perform a cyclic shift within the search region. Let W K and H K denote the width and height of the bounding box at K frame, respectively. In our algorithm, we use W K and H K to represent the search region, i.e. W search = ⋅ W K and H search = ⋅ H K . Therefore, the search region S search can be expressed as where is a positive integer greater than 1. The larger the values of , the larger the search region S search . Then, the sliding step size of the sliding window in the direction of the x coordinate axis and the y coordinate axis can be written as follows where is the number of sample points along the x and y axes. This means that a total of 2 samples are generated throughout the local search region.

Global search strategy
In Section 4, extensive experiments demonstrate the effectiveness of the proposed local search strategy. However, local search is obviously faced with a serious problem that the method will fail when the target moves out of the search region S search . One of the most straightforward ways to alleviate this flaw is to increase and , but this can cause serious degradation to the real-time performance of the tracker. This is mainly because the number of samples increases quadratically with an increase in , and such a computational burden is unaffordable for the tracker. Considering that some computationally efficient target detection algorithms have been applied to the object tracking field, we use the EdgeBox [61] to perform global search on images. In simple terms, the core idea of EdgeBox is to calculate a score for each sliding window that represents the likelihood that the area corresponding to the sliding window contains objects. If the score is large, the area is likely to contain objects. On the contrary, if the score is small, the possibility that the area contains objects is small. EdgeBox can quickly generate object bounding box proposals directly from the edges and it typically takes only about 0.35s to generate 1000 proposals in a 480 × 720 image. Given the advantages of EdgeBox, using this method to perform global search on images does not have a significant adverse effect on tracking speed. More details about EdgeBox can be found in the literature [61]. As for the switch between local search and global search, we will record the average displacementD of the target in the last 5 frames. Assuming that the tracking failure occurs at the K th frame, in order to improve real-time performance and detection efficiency, we directly filter out the subsequent 10 frames of images before starting detection. In the actual process, most objects are moving at a constant speed and the average displacement is generally in the range of 4-8 pixels, while the fast-moving objects reach 20 pixels and above. On the basis of ensuring tracking accuracy and running speed, we set the local search area to 4 times the object area and the number of sampling points to 10, because too large settings of and will seriously affect the real-time performance of the tracker. Taking the general object size of 80 × 80 as a reference, the local search area is 320 × 320, and the search radius is about 160 pixels. Therefore, in order to determine whether the lost object is still distributed in the range of 160 pixels after 10 frames, we set the average displacement boundary value to 15 pixels. IfD is less than 15 pixels, local search is used, otherwise global search.
In the EdgeBox, we set the threshold to accept only the top N = 200 proposals with high confidence scores. Considering the appearance variation of the objects during the tracking process, we make the following constraints on the N proposals to improve efficiency: where b w and b h are the width and height of the proposals, respectively. The b K w and b K h are the width and height of the object bounding box at K th frame. is a constraint coefficient.
Since the value of determines the number of candidate samples, choosing an appropriate can improve the running speed without deteriorating the tracking accuracy. We show a case in Figure 3, where the number of candidate samples N ′ corresponding to different is marked in each subgraph. It can be clearly seen from Figure 3 that when the value of gradually decreases, the number of candidate samples N ′ will significantly decrease. This means that appropriately reducing the value of can effectively reduce the computational burden in the object recovery mechanism. In order to better balance the running speed and tracking accuracy, we set the limit factor to 1.2. Of course, there are advantages and disadvantages to doing this, but in most cases the gains are acceptable.

Target determination
As can be seen from Equations (11), (12) and (13), we will get 2 and N ′ samples respectively in the local search and the global search. Then, these samples and T i are substituted into Equations (3) and (4) to obtain a response matrix. For local search, the response matrix is represented as L, which contains 2 sub-matrices. For global search, the response matrix is written as G, which contains N ′ sub-matrices. Generally speaking, the object corresponding to the sample with the maximum response is most likely the target lost by the tracker at K th frame. Therefore, it is only necessary to find the maximum response in the L and G respectively. However, the response of the similar object is relatively large, which may lead to the detector to recover to a wrong target. To alleviate this dilemma, we set a threshold on the time domain to constrain the maximum response value. In Table 1, experiments were conducted on 24 video sequences [22,24,25] to investigate the effect of on the object recovery mechanism. In order to observe the accuracy performance under different more intuitively, we averaged the accuracy corresponding to the 24 sequences under each , and the average accuracy results under different are shown in Figure 4. From the results in Figure 4, the average accuracy increases monotonically when takes a value of 0 to 0.3, but monotonically decreases when takes 0.3 to 1. In general, the object recovery mechanism has optimal performance when is 0.3, and the average accuracy is improved by 0.1485 compared to

Model update
During tracking, the appearance of the object changes significantly due to factors such as rotation and deformation. Therefore, the target template should be updated appropriately during

FIGURE 5
The whole flowchart of the proposed method. Our tracker is mainly composed of five parts, including short-term correlation filter tracking module, reliability evaluation module, template pool module, object recovery module and model update module tracking to give the tracker more powerful performance. If the target template is updated too frequently, the template is easily contaminated by background information and other noise. Conversely, the target template cannot capture the normal appearance of the target in a timely manner if the target template is updated too slowly. Evaluating the reliability of the tracking results from a continuous time perspective is comprehensive, but may ignore the state of the target in a single frame, especially occlusion. Therefore, we compensate for this deficiency by using the PSR to estimate the occlusion of the target in each frame, so that the model update coefficient in Equation (6) can be more timely adapted to the appearance change of the target.
The whole flowchart of our tracker is shown in Figure 5. The tracker is mainly composed of five parts, including shortterm correlation filter tracking module, reliability evaluation module, template pool module, object recovery module and model update module. After the initial object is confirmed, the short-term tracker is used to perform basic tracking of the object, and at the same time establish a template pool to maintain the appearance model of the object. Then, the reliability evaluation criterion is used to discriminate the tracking result. If the tracking result is unreliable, the object recovery mechanism is activated to recover the lost object from subsequent images. Otherwise, the short-term tracker continues to perform basic tracking. Finally, the template is appropriately updated to adapt to changes in the appearance of objects during tracking.

Hyper parameters analysis of criterion
We conducted extensive experiments on three datasets, OTB [24], Temple-Color [25] and UAV [26], to get general conclusions in our tracking method. In order to more fully display the distribution of the response values, we mark the target by the ratio of the target size to the image size and then use 0.005 as an interval to count the response values of all the targets in each interval and average them. The distribution of response values is shown in Figure 6. According to the observations in Figure 6, we found that when the response value falls to the range of 30% to 40% of the initial frame, the O(B 1 , B 0 ) will be around 0.4, and when the response value further drops to 20% to 30% of the initial frame, the O(B 1 , B 0 ) will be around 0.2. Therefore, taking the values of d and as 0.35 and 0.7 respectively can make the criterion have a better accuracy. To further verify the effectiveness of the proposed method, we also tested the reliability evaluation criterion in twelve typical scenarios. It can be seen intuitively from Figure 7 that when the tracker fails or drifts, the response value of the object drops significantly compared to the initial value . Besides, the fluctuation of the response value shown in Figure 7 meets the trigger conditions of Equations (7) and (8). The test results demonstrate that the proposed criterion can accurately determine whether the tracker fails in different scenarios. Since our method uses the primitive correlation filter and does not adjust the response map, the response values are distributed between 0 and 1. As long as the object is disturbed by occlusion, deformation and other factors, the response value will drop significantly, which means that the proposed criterion is suitable for all datasets under the primitive correlation filtering tracking framework.

Ablation study
To investigate the contribution of different components to the tracker, we modified the proposed tracker to form multiple vari- ants. There are five main components to consider, including continuous response value criterion (CRV), peak to sidelobe ratio criterion (PSR), median flow criterion (MF), local search strategy (LS) and global search strategy (GS). It should be noted that the baseline tracker only refers to the basic architecture of SAMF tracker. We manually combine a basic correlation filter with a scale pool to form a baseline tracker instead of directly using the original SAMF tracking code. The LGS represents that LS and GS can be used alternately according toD. The details of our tracker (CRV-LGS) and other variants are presented in Table 2. For each variant, we use a red tick to indicate the available components and a blue cross to indicate which components are not utilized. The tracking results of our tracker and the other nine variants on the OTB2015 dataset are shown in Figure 8. As can be seen from the first row of Figure 8, the LGS search mode is more effective than LS and GS under different reliability evaluation criteria. This can be attributed to the way LGS fully considers the motion information of the target, so the proposed tracker can flexibly adopt the appropriate search method in different scenarios. It can also be seen from the first row of Figure 8 that the effectiveness of LS is better than that of GS. This is mainly because LS only performs object search on the area of interest, while GS performs object search on the entire area. Therefore, compared with the GS search mode, LS can more effectively locate objects and avoid the interference of similar objects in other regions. For example, as shown in Figure 3, the detector detects multiple similar faces in GS mode, which will increase the probability of the detector recovering to a wrong object. From the comparison results in the second row of Figure 8, it is obvious that the performance of CRV is better than PSR and MF in different search modes. This is because the PSR evaluates the response map for a single frame and the MF considers the displacement of pixels within the bounding box in two adjacent frames. It is not comprehensive to assess the reli- ability of the tracking process only by the metrics obtained in a single frame or two adjacent frames. Since CRV considers the reliability of tracking results from a continuous time perspective, it is more comprehensive and effective than PSR and MF.
In terms of overall performance, the proposed tracker is significantly higher than the baseline tracker by 10.43% and 6.85%, respectively, in precision rate and success rate. It is worth noting that compared to the baseline tracker, our proposed tracker did not show significant speed degradation, with a drop of only 3.8 frames per second. This is because in the proposed method, the tracker can adaptively select the appropriate detection mode and the number of proposals is set to 200, so the time required for object detection is further reduced. Besides, the short-term correlation filter tracker has good real-time performance and the main time-consuming part is the recovery of the object. Therefore, when the object recovery mechanism is not enabled, the proposed tracker can be regarded as a baseline tracker. In general, when tracking objects, the phenomenon of tracking failure does not occur frequently, which means that the object recovery mechanism will not intervene frequently, so the real-time performance of the tracker will only be slightly affected. For these reasons, although the overall running speed of the tracker will deteriorate, the reduction is not significant. Through the results of ablation experiments, we can see that each component contributes to improve the performance of the tracker in different degrees. In terms of the strength of the contribution, the proposed components are superior to others in enhancing the tracking performance, thus demonstrating the effectiveness of the proposed method.

Comparison with state-of-the-arts
We use 16 state-of-the-art trackers to make a comprehensive comparison with our approach on OTB dataset, including SANet [57], MCPF [58], SiamFC [39], ECO [43], ACFN [62], BACF [10], LMCF [63], SRDCF [21], MEEM [36], DLSSVM [64], DCFNet [65], Staple [66], SAMF [14], LCT [46], DSST [13] and KCF [18] by using OPE. Among them, SANet incorporates recurrent neural network (RNN) into CNN, MCPF combines particle filters and correlation filters, DLSSVM is based on the structured support vector machine method, ECO, BACF, LMCF, SDRCF, DSST, LCT, SAMF, Staple and KCF are based on the correlation filter. MEEM is developed based on regression and multiple trackers. DCFNet integrates the Siamese network and ACFN uses the attentional correlation filter network. The OPE performance comparison on OTB-2013 and OTB-2015 is shown in the first and second rows of Figure 9. It can be seen from the comparison results that even though the precision rate and success rate of our tracker are not as good as SANet, MCPF and ECO trackers, the proposed tracker still achieves considerable performance. In particular, the running speed of SANet and MCPF is only about 1 frame per second, which is far from meeting the real-time tracking requirements. Since our tracker only uses hand-crafted features, the performance of the proposed tracker still has great potential for improvement, such as the use of CNN features in our tracking framework. With the aid of reliability assessment and object recovery mechanism, the proposed method significantly improves the performance in both distance precision (7.62%) and overlap success (5.34%) compared with the basic short-term tracker SAMF on the OTB-2015 dataset.
On the TC-128 dataset, we validated our algorithm with 11 state-of-the-art trackers, including MEEM [36], SAMF [14], Struck [31], ASLA [67], KCF [18], MIL [29], DCF-CA [68], OCT-KCF [69], FCT [16], L1APG [70] and LC-CF [71]. Among them, OCT-KCF, L1APG and DCF-CA have the ability to handle occlusion. In particular, OCT-KCF is recently developed specifically to solve the occlusion and drifting problems. The comparison results of the tracking are shown in the third row of Figure 9. It can be seen that the overall performance of our tracker is the best, and the precision rate and success rate are 5.79% and 3.21% higher than SAMF, respectively.
Further, we also implemented a performance evaluation of the proposed tracker on the UAV dataset and compared the tracking results with the 14 state-of-the-art trackers, including DSLS [72], MEEM [36], SRDCF [21], MUSTER [73], Struck [31], SAMF [14], DSST [13], TLD [33], DCF [18], KCF [18], CSK [17], MOSSE [20], ASLA [67] and IVT [74]. DSLS is a high performance tracker based on optimized structure learning and dense image representation. In MUSTER, a dualcomponent approach consisting of short-term and long-term memory stores are used to process target appearance memories. IVT uses incremental PCA in a particle-based filtering framework. The tracking results are presented in the fourth and fifth rows of Figure 9. As can be seen from the comparison, our tracker has a very significant advantage in long-term tracking and is 3.1% and 0.3% higher than DSLS in terms of precision rate and success rate, respectively. However, the overall performance of our tracker on the UAV123 dataset is not as good as DSLS, SRDCF, and MEEM. The cause of this phenomenon is mainly the boundary effect, because DSLS and MEEM use a detection-based architecture and implement tracking through an SVM classifier, so compared with DCF-based trackers, these trackers will not be subject to boundaries effect.  As for SRDCF, the focus of this tracker is to solve the boundary effect, so its basic tracking framework is more advantageous than our method. It is worth mentioning that the objects in the UAV dataset are mostly small targets and the sampling range of the basic correlation filter tracking is related to the object size, which severely limits the search area and brings serious boundary effect. In terms of performance gains, compared with the basic tracker SAMF, the proposed tracker improves by 11.50% in distance precision and 8.20% in overlap success.

4.5
Attribute-based evaluation

Evaluation on OTB
To In the distance precision comparison of Figure 10, our tracker ranks first on the DEF and second on the three challenges of OPR, SV and OCC. In the comparison of the overlap success of Figure 11, our tracker ranks second on the four challenges of OPR, OCC, DEF and BC. Compared with the best-performing ECO tracker, although the performance of our method in terms of OCC, IPR and OPR is not as good as ECO, the perfor-mance difference is slight. In particular, the distance precision of our tracker on DEF is 0.92% higher than that of ECO. It can be seen from the attribute-based tracking results that our tracker has considerable performance in short-term tracking. This means that the proposed reliability assessment criterion and object recovery mechanisms can effectively enhance the ability of the tracker to cope with different scenarios. In addition, the proposed tracker performs well in DEF, OCC and OPR, which also provides a strong guarantee for solving longterm tracking problems, because the target always experiences severe deformation and occlusion in long-term tracking.

Evaluation on UAV20L
We tested the proposed tracker on the UAV20L dataset to further verify the performance in long-term tracking. All sequences in the UAV dataset are annotated with 12 attributes, including aspect ratio change (ARC), background clutter (BC), camera motion (CM), fast motion (FM), full occlusion (FO), illumination variation (IV), low resolution (LR), out-of-view (OV), partial occlusion (PO), similar object (SO), scale variation (SV) and viewpoint change (VC). Figure 12 shows the success rate of the trackers under different challenging attributes. As can be seen from Figure 12, the proposed tracker has achieved considerable performance in most of the attribute scenarios. Among them, especially in the challenges of occlusion, background clutter and aspect ratio change, our tracker is far superior to other tracking algorithms. Compared with the second best tracker DSLS, the success rate of our method is

FIGURE 12
The attribute-based evaluation results of the trackers on the UAV20L dataset 3.0%, 7.1% and 5.9% higher in the ARC, BC and FO scenarios, respectively. There are two main reasons why the proposed tracker is more competitive than other trackers. First, the valid criterion is used in the tracker to fully assess the reliability of the tracking process, which prevents the tracker from continuously tracking the wrong target. Secondly, an object recovery mechanism with resistance to similar object interference is embedded in the tracker. When the object tracking fails, the mechanism can find the most reliable target to reinitialize the tracker. Therefore, these favourable factors make the proposed tracker more robust.

Qualitative evaluation on OTB, TC-128 and UAV
We visualized the tracking results on the OTB, TC-128 and UAV datasets to more intuitively demonstrate the advantages of the proposed tracker. In all of these sequences, the target experienced severe deformation and occlusion. The detailed qualitative tracking examples for our tracker and other state-of-theart trackers are reported in Figures 13, 14 and 15. From the comparison of the results on the OTB in Figure 13, only our trackers closely followed the targets in all the sequences, while other trackers experienced varying degrees of drift or even lost targets. In particular, in the sequence Girl2, the target experienced severe full occlusion and appearance changes, and our tracker successfully recovered the target with the help of the reliability assessment and object recovery mechanism, but all other remaining trackers failed. As shown in Figure 14, only our tracker works well in the Busstation-ce1, Carchase1, Face-ce sequences, while other trackers lose their targets. It is worth noting that there are many similar faces in the sequence Facece, but our algorithm does not recover a wrong target, so this demonstrates that the detection coefficient has a significant effect in eliminating similar interference. Regarding to the sequences Motorbike-ce, Panda, Toyplane, the proposed tracker can track the target stably in the whole process, while most of the other trackers fail to track after the target is occluded. From the comparison of the long-term tracking results in Figure 15, it can be seen that the proposed tracker has significant advantages over other trackers. In the sequences Bike, Car16 and Person7, the appearance and trajectory of the target have undergone dramatic changes, and our tracker can be adapted   well to these changes with the help of effective model update strategy and object recovery mechanism. However, other trackers have caused tracking drift or failure due to the inability to update the model in time. As for sequences Group1, Group2 and Group3, our tracker is very steadily keeping track of the target, while other trackers have almost lost their targets. Especially in Group2 and Group3, due to occlusion, deformation, and other factors, all the other trackers fail, and our tracking is still very robust. It can be seen from the visual comparison results that the proposed tracker has a strong ability to deal with the occlusion and deformation of the target.

Performance comparison of other DCF-based trackers with the proposed method
In order to further explore the applicability of the proposed method, we integrated reliability assessment and object recovery mechanism into DCF, KCF and OCT-KCF trackers. The comparison results of the improved trackers are shown in Figure 16 and Table 3. It is obvious that the performance of these three trackers has been greatly improved with the help of reliability assessment and recovery mechanism. In terms of test results on OTB2015, the success rates of DCF-RERM, KCF-RERM and OCT-KCF-RERM are 2.68%, 4.05% and 0.67% higher than their original versions, respectively. In addition, the evaluation results on UAV20L also show that the proposed method has a significant contribution to the performance improvement of trackers, among which the performance of DCF-RERM, KCF-RERM and OCT-KCF-RERM has increased by 12.1%, 13.2% and 3.9% respectively in the success rate. It is worth noting that The main time-consuming part comes from the object recovery process after tracking failure, which involves a relatively large amount of calculation. In particular, KCF and DCF only use the Hog feature, so it is easier to lose objects during the tracking process, resulting in frequent intervention of the object recovery mechanism, which ultimately makes the real-time performance significantly degraded. For the generality of the proposed method, we find that the reliability evaluation method is suitable for the tracking method using the most basic correlation filter, that is, the response value is distributed between 0 and 1. However, it is not suitable for trackers that have adaptively adjusted the response map, such as CF-CA [68] and CF-AT [75], because the adjusted target response score will no longer be distributed between 0 and 1. This means that the setting of reliability evaluation criterion needs to fully consider the structural characteristics of the tracker itself. As for the proposed object recovery mechanism, the method is mainly to find interested candidates first, and then measure the similarity between the candidates and the template to obtain the objects with the highest confidence. This implementation idea is also adopted by most recent long-term trackers and is not limited by the structure of the DCF-based tracker. Therefore, the object recovery mechanism has better versatility than the reliability assessment criterion based on continuous response values.

Failure cases
In the experiment, we also discover a few failure cases as shown in Figure 17. For the Skiing sequence, the reason for tracking failure is mainly due to the small search area of the correlation tracking components. The small search area cannot fully cover the distribution range of moving objects. Once the object moves outside the search area, it means that the tracker will lose the object, especially the small object with fast movement speed.
In the Surf-ce1 sequence, the boundary effect makes the background information also have a strong response, so the tracker gradually drifts, which also causes the template to be contaminated by the background information and eventually causes the tracker to lose the target. Regarding the car2-s sequence, traditional hand-crafted features have limitations in characterizing the appearance model of the object. Therefore, when the object is highly similar to the background, the expression of the feature cannot make the object distinguishable, which causes the tracker to drift into the background.

CONCLUSION
The aim of this paper is to present a new tracker with reliability assessment and object recovery mechanisms to address the challenges of long-term tracking. The proposed tracker uses a correlation filter as a short-term tracker to ensure real-time tracking, and then embeds the reliability assessment and object recovery mechanism to enable the tracker to recover from tracking failure. Considering the appearance and motion characteristics of the target, we build a template pool to maintain the appearance model of the target, and adopt a search strategy that can be flexibly switched between local and global to detect the target. Extensive experimental results on several benchmark datasets show that our tracker achieves impressive performance in both accuracy and speed, and significantly outperforms some related state-of-the-art trackers. Since our approach uses traditional hand-crafted features, the overall performance of the proposed tracker is limited to some extent. In the future, we will explore the application of deep features to further improve the performance.