Learning adaptive spatial–temporal regularized correlation ﬁlters for visual tracking

Recently, there have been many visual tracking methods based on correlation ﬁlters. These methods mainly enhance the tracking performances by considering the information of background, space, or time in the appearance model. This paper proposes an effective tracking method, named adaptive spatial–temporal regularized correlation ﬁlter (ASTRCF) tracker, based on the popular adaptive spatially regularized correlation ﬁlter (ASRCF) tracker. That is, the continuity of object’s motion in the process of tracking is considered by introducing a temporal-regularized term in the appearance model of ASRCF tracker. Furthermore, its solution is inferred by applying the alternating direction method of multipliers. The proposed appearance model contains a background-awareness term, a spatially regularized term, an adaptive-weight term, and a temporal-regularized term. Therefore, it can not only keep the good performances of ASRCF tracker, such as learning the background information and the spatial information adaptively to enhance the discriminating ability, but also take advantage of the relation of correlation ﬁlters in the last frame and the current frame for addressing the complex cases, such as occlusion, and fast motion. Extensive experimental results on various challenging databases show that the proposed ASTRCF tracker achieves better tracking performances than some state-of-the-art trackers.


INTRODUCTION
Visual tracking is a hot topic in computer vision because of its popular applications in real life. Its goal is to design an effective tracker for determining the position of the object automatically only with the given initial position in the first frame. As the tracker is usually trained by some features extracted from the last frame, some appearance variations of the object, such as occlusion, fast motion, motion blur, and deformation, will affect the tracking results seriously. Therefore, it is still a challenging task to design a robust and efficient tracker. proposed a minimum output sum of squared error (MOSSE) tracker [1]. It mainly learns the correlation filter with a set of extracted features using fast Fourier transform (FFT), then predicts the position of the object in the next frame by the learned correlation filter. Since then, numerous improved trackers have been proposed based on different prior constraints. Henriques et al. exploited the circulate structure of the shifted image patches in kernel space and proposed a kernelized correlation filter (KCF) tracker [2]. Danelljan et al. proposed a discriminative scale space tracker (DSST) for visual tracking by introducing a scale regularization term in the appearance model of the MOSSE tracker. Its appearance model for learning the desired correlation filter h can be described as follows: IET Image Process. 2021;15:1773-1785. wileyonlinelibrary.com/iet-ipr where x k and h k denote the k-th channel of the vectorized feature image x = [x 1 , x 2 , …, x K ] and correlation filter h = [h 1 , h 2 , … , h K ], respectively; k = 1, 2, … , K , y is the desired Gaussian response; * denotes the convolution operation; and is a regularization parameter. In order to relieve the unwanted boundary effect on the tracking results, Danelljan et al. proposed a spatial regularized correlation filter (SRDCF) tracker [3] by fusing the spatial information in the scale regularization term in the appearance model of the DSST tracker. Its appearance model for learning the correlation filter can be described as follows: arg min where w is a negative Gaussian-shaped spatial weight matrix to make the learned filter have a high response around the center of the tracked object, and ⊙ is the dot multiplication.
Considering the robustness requirement of a tracker for the complicated scenarios, Bertinetto et al. proposed a Staple tracker [4] by combining the complementary cues in the ridge regression framework for speeding up the tracking. Mueller et al. presented a context aware correlation filter (CACF) tracker [5] that considered the surrounding context information in the appearance model for learning the correlation filter. In the meanwhile, Galoogahi et al. proposed a background-aware correlation filter (BACF) tracker that trained the correlation filter from negative training examples to remit the boundary effect [6].
Recently, with the powerful ability of extracting high-level features, deep learning is also introduced in the area of visual tracking. Some people design effective deep networks to learn the relation between the frame and the object. For example, Zhang et al. proposed an effective convolutional network without training (CNT) tracker [7] for determining the position of target. Although deep network based trackers can improve the tracking results, they cannot use very deep networks for tracking the object because of the real-time requirement in the process of tracking. Therefore, some people combine the advantages of deep learning and correlation filter. They use the pre-trained deep networks to extract the high-level features as the training samples for learning the correlation filter. For example, Qi et al. proposed a hedged deep tracking (HDT) tracker [8] that took full advantage of features from different convolutional layers and used the adaptive Hedge method to hedge several DSST trackers into a single stronger one.
With the combination of deep features and hand-craft features, many improved CF-based trackers have been proposed. Based on SRDCF tracker, Li et al. proposed a spatially temporal regularized correlation filter (STRCF) tracker [9] by introducing a temporal regularization term in the appearance model of the SRDCF tracker. Its appearance model for learning the correlation filter can be described as follows: is the combination of deep features and hand-craft features, h t −1 denotes the correlation filter learned from the (t − 1)-th frame, and ‖h − h t −1 ‖ 2 2 denotes the temporal regularization term.
Based on BACF tracker, Yuan et al. proposed a temporal regularization background-aware correlation filter (TRBACF) tracker [10] by introducing a temporal regularization term in the appearance model of the BACF tracker to efficiently adapt the change of the tracking scenes. Subsequently, considering that an adaptive weight matrix is more appropriate for a tracker, Dai et al. proposed an adaptive spatially regularized correlation filter (ASRCF) tracker [11] by introducing the adaptive spatial regularization terms in the appearance model of the BACF tracker. Its appearance model can be described as follows: where P is a diagonal binary matrix to make the correlation filter directly apply on the true foreground and background samples, and w r is a reference weight. From above analysis, we found that the STRCF tracker considered the spatial information and the temporal information, not considering the background information and the variation of the spatial weight matrix w for the whole tracking process. While the TRBACF tracker considered the background information and the temporal information, not considering the spatial information and the variation of the spatial weight matrix, and ASRCF tracker considered the background information, spatial information, and the variation of the spatial weight matrix, not considering the temporal information.
This paper proposes a novel CF-based tracker, named an adaptive spatial-temporal regularized correlation filter (ASTRCF) tracker, by introducing the temporal regularization term in the appearance model of the ASRCF tracker. Hence, the proposed ASTRCF tracker not only considers the background information, spatial information, and the variation of the spatial weight matrix, but also considers the temporal information. By means of above prior constraints, our proposed ASTRCF tracker shows the good robustness for the complicated scenes. The main contributions of this paper are as follows: • We propose a novel CF-based tracker, named an adaptive spatial-temporal regularized correlation filter (ASTRCF) + + FIGURE 1 A flowchart of our proposed ASTRCF tracker tracker, by introducing a temporal regularization term in the appearance model of the ASRCF tracker. • For the proposed new appearance model, we apply the alternating direction method of multipliers (ADMM) to deduce the iterative solutions. • The proposed ASTRCF tracker can not only take advantage of the background information, spatial information, and the variation of the spatial weight matrix, but also explore the relation between the last frame and the current frame. Therefore, our tracker has good robustness for complicated scenes.
The rest of the paper is organized as follows: Section 2 gives the details of our proposed ASTRCF tracker. Section 3 compares the results of our ASTRCF tracker with some other stateof-the-art trackers. Section 4 concludes the paper.

PROPOSED ASTRCF TRACKER
In order to enhance the robustness of ASRCF tracker for the complicated scenes, we propose a novel CF-based tracker, ASTRCF tracker, by taking use of the relation of the last correlation filter and the current correlation filter. Firstly, our ASTRCF tracker designs a new appearance model by introducing a temporal regularization term in the appearance model of ASRCF tracker. Secondly, we apply ADMM [12] to deduce the iteration solution of above appearance model. In the meanwhile, we still use KCF model for the multiple scale searching. The flowchart of our proposed ASTRCF tracker is shown in Figure 1.

Proposed ASTRCF appearance model for learning CF
Let I t be the t -th frame in a video sequence with a tracked target (position P t and scale S t ), then the task of our tracker is to determine the target (position P t +1 and scale S t +1 ) in the subsequent frame I t +1 effectively. Here, the position P t is the coordinate of the target's center, and the scale S t is the target's bonding box with a height and a width. Since our proposed ASTRCF tracker is based on the theory of correlation filter, the first thing is to collect some training samples for learning the correlation filter. In order to avoid transmitting the sampling errors because of the tracking drift to the subsequent tracking process, and learn some background information around the target to avoid the interferences from some similar patches, we apply the circular shift operation on the whole t -th frame for collecting the training samples.
For each training sample ( In order to facilitate the calculation, we vectorized x k i and y i , and still used the original symbols to denote them. Let x denote the matrix whose (i, j )-th block is the vector x j i , and use x k to denote the k-th column of the matrix In order to reflect the appearance variations of an object for the whole tracking process and use the relation between the last frame and current frame, we propose a novel adaptive spatial-temporal regularized correlation filter appearance model for learning the effective correlation filter. Its optimal model can be described as follows: where P ∈ ℝ M 2 0 N ×M 2 0 N is a diagonal binary matrix to make the correlation filter directly apply on the true foreground and background samples, w r is a reference weight to make the learned filter have a high response around the center of the tracked object, w is the spatial weight matrix to be learned, h t −1 is the correlation filter learned for the (t − 1)-th frame, and i is a regularization parameter, i = 1, 2, 3.
In above model (5), the first term is a ridge regression term that uses the background information. The second term emphasizes the foreground information. The third term attempts to make the adaptive spatial weight w be similar to a reference weight w r . This constraint introduces a prior information on w and avoids the model degradation. The last term is a temporal regularization term that makes the correlation filter h be similar to the learned correlation filter h t −1 because of the continuity of object's movement. Therefore, our proposed ASTRCF model can not only use the background and foreground of the object's variation effectively via the adaptive spatial weight, but also reflect the relation of the correlation filters between the last frame and the subsequent frame, which can relieve the drifting problem caused by the quick movement of the object.
Generally, CF-based trackers learn correlation filter and predict the position of the object on the frequency domain. Hence, we also solve the optimization problem (5) on the frequency domain. Firstly, we convert (5) into the following optimal form with equality constraints: whereĜ = [ĝ 1 ,ĝ 2 , … ,ĝ K ] is an auxiliary variable matrix, and F ∈ ℝ M 2 0 N ×M 2 0 N is the discrete Fourier matrix corresponding to the Fourier transform on ℝ M 2 0 N . For the bi-convex model (6), we apply ADMM [12] to obtain its local optimal solution. The augmented Lagrangian function for the problem (6) is (7) is equivalent to the following form: . Now we apply ADMM to solve above optimization problem (8). Using ADMM, above optimization problem (8) will be converted to following three subproblems: Now we give the concrete steps for solving above subproblems.
Subproblem h. For k = 1, 2, … , K , whenĜ , w, andÛ are given, the optimal correlation filter for the k-th channel h k can be computed as follows: where W ( j ) = diag(w( j )) ∈ ℝ M 2 0 N ×M 2 0 N denotes the diagonal matrix, and I is the identity matrix with M 2 0 N order. SubproblemĜ ∶ When h k (k = 1, 2, … , K ), w, andÛ are given, the optimalĜ in formula (9) iŝ Due to high computational complexity, it is difficult to optimize above formula (11) directly. Hence, we solve this problem by means of pixel, and describe above optimization (11) as follows: where V i (Ĝ ) ∈ ℝ K denotes the vector composed of the i-th row of the matrixĜ . Then the solution of above problem (12) is ) .
Subproblem w: If h is fixed, the solution of w can be computed by In this paper, the initial reference weight w r can be taken as the negative Gaussian-shaped spatial weight matrix, and is updated by w( j ). For the Lagrangian multipliersÛ , we use the following formula to updatê whereÛ ( j + 1) is the Fourier transform of U ( j + 1).
Up to now, the optimization problem can be computed by above four steps iteratively. Now the position of the tracked object can be obtained in the Fourier domain bŷ where r 1 is the response map andr 1 is its Fourier transform.
With the obtained response map, the location of the object can be determined by the maximum response value.

KCF model for scale
In this subsection, we explain KCF model for learning the correlation filter g t briefly for the scale estimation. For each training sample ( Let z be the matrix whose the (i, j )-th block is the vector z j i , and use z to denote the -th column of the matrix z, = 1, 2, … , L. Put y = (y ⊤ 1 , y ⊤ 2 , … , y ⊤ N ) ⊤ . During the tracking process, we apply KCF model as follows to learn the scale correlation filter g t : For above optimization (17), we give a closed form solution in the primal domain by During the tracking process, we apply this correlation filter g t on five scale search regions and obtain their corresponding response maps. Then, the best scale can be determined according to the maximum score of five response maps.
Databases: This paper uses three popular video databases: OTB50, OTB100, DTB70 and TC128 in computer vision for experimental comparison. Here, OTB50 and OTB100 databases contain 100 challenging video sequences, DTB70 database consists of 70 challenging color video sequences, and TC128 database contains 128 challenging video sequences. All these video sequences contain 11 different attributes: low resolution (LR), in-plane rotation (IPR), out-of-plane rotation (OPR), scale variation (SV), occlusion (OCC), deformation (DEF), background clutter (BC), illumination variation (IV), motion blur (MB), fast motion (FM), and out of view (OV). The one pass evaluation (OPE) is employed to evaluate different trackers based on two criteria: overlap success and distance precision.
Parameters setting: In the following experiments, the parameters in our proposed ASTRCF tracker are set as follows: the number of scale is 5, the scale-step is 1.01, the number of iterations for ADMM is 2, = 10, and the learning rate = 0.02. When 1 = 0.5, 2 = 1.5, and 3 = 0.01, our tracker can achieve the best performance on OTB50, OTB100, DTB70, and TC128 databases.
All the experiments are carried out in Matlab2015b on a computer with Intel Xeon CPU E5-1620 v3 @ 3.50GHz.

3.1
Overall performances Table 1 shows the success rate and precision rate of our proposed ASTRCF tracker with 10 state-of-the-art trackers on OTB50 database, OTB100 database, DTB70 database, and TC128 database, and Figures 2, 3, 4, and 5 are the corresponding success plots and precision plots of the trackers in Table 1.
As observed from Table 1 and Figures 2-5 our proposed ASTRCF tracker achieves the best performances on success rate and precision rate among 11 popular trackers on OTB50, OTB100, DTB70, and TC128 databases, respectively. For example, on OTB50 database, the success rate and precision rate of our proposed ASTRCF tracker can reach 0.669 and 0.824, respectively. They are higher than those of the second top tracker ASRCF with 0.006 and 0.015, and the third top tracker STRCF with 0.052 and 0.063, respectively. On TC128 database, the success rate and precision rate of our proposed ASTRCF tracker can reach 0.461 and 0.624, respectively. They are higher than those of the second top tracker ASRCF with 0.013 and   0.021, respectively. The main reason is that the temporal regularization in our tracker can learn more appropriate correlation filter by taking full use of more information of the previous correlation filter.

Attribute-based performances
In this subsection, we want to exhibit the performances of our proposed ASTRCF tracker on 11 attributes of video sequences. Table 2 shows the comparison of success rate on 11 attributes of our proposed ASTRCF tracker with KCF [2], SRDCF [3], BACF [6], CNT [7], HDT [8], STRCF [9], ASRCF [11], CT [17], IVT [18], and DFT [19] on OTB100 database. As observed from Table 2, the rank of top three success rates on 11 attributes are our ASTRCF, ASRCF, and STRCF tracker successively. The reason is that our proposed ASTRCF tracker considers the background information, spatial information, adaptive weight, and temporal information simultaneously.
In detail, our proposed ASTRCF tracker ranks the first on nine attributes, andranksthesecondonthele fttwo attributes: OV and DEF. In the aspect of OV attribute, the value of our ASTRCF tracker is 0.011 less than that of ASRCF tracker. The reason is that there is a temporal regularization term in the appearance model of our ASTRCF, but not in ASRCF tracker. While OV attribute means that there is a huge variation on object between the adjacent frames, so the temporal regularization may influence the tracking result. Table 3 shows the comparison of precision rate on 11 attributes of our proposed ASTRCF tracker with KCF [2], SRDCF [3], BACF [6], CNT [7], HDT [8], STRCF [9], ASRCF [11], CT [17], IVT [18], and DFT [19] on OTB100 database.
As observed from Table 3, the rank of top three precision rates on 11 attributes are our ASTRCF, ASRCF, and STRCF tracker successively. The reason is that our proposed ASTRCF tracker considers the background information, spatial information, adaptive weight, and temporal information simultaneously. In detail, our proposed ASTRCF tracker ranks the first on 10 attributes, and ranks the second on the left OV attribute. The reason is that the video sequences with OV attribute have drastic variations on the objects' appearances between the adjacent frames, while the temporal regularization in our ASTRCF tracker maybe interrupt the tracking results.

Qualitative evaluation
In this subsection, we perform the qualitative evaluation of our proposed ASTRCF tracker with three outstanding trackers on five attributes: background clutters, deformation, fast motion, occlusion, and scale variation.

3.3.1
Background clutters (BC) Figure 6 shows several tracking results of our proposed ASTRCF tracker with HDT, BACF, and STRCF trackers on nine frameso fthree challenging video sequences with background clutters. In these frames, the colors of the backgrounds around the targets are similar to the targets over the whole processing. For example, the background and foreground of the "Soccer" sequence have extremely high similarity. The man wearing the red T-shirt is surrounded by the red flags (e.g. #65, #137, and #214) and some occlusions (e.g. golden cup). As observed from Figure 6, only our ASTRCF tracker (red box) keeps grasping the target accurately. Although STRCF tracker can also locate the target, the precision of location is lower than ours. While BACF tracker and HDT tracker emerge different degree of tracking drift. Concretely, in the "Football" sequence, there are some partial occlusions (e.g. #72 and #167) and multiple clutters in the background (e.g. #359). Although our ASTRCF tracker and STRCF tracker both have good performances for slight occlusion, only our tracker can track the target well in the case of background clutters. While BACF tracker and HDT tracker drift into the background apparently. In the "Dudek" sequence, the cluttered background has a slight influence on target tracking, especially in frame #744 and frame #923. Obviously, our tracking results are more accurate than those of other three trackers.

3.3.2
Deformation (DEF) Figure 7 shows several tracking results of our proposed ASTRCF tracker with HDT, BACF, and STRCF trackers on nine frameso fthree challenging video sequences with deformation. In these frames, the appearances of the targets happen to deform over the whole processing. For example, in the "MountainBike6" sequence, athletes constantly deform during running (e.g. #21, #51, and #80). In the "Bird1" sequence, the birds have large deformations during flying through the heavy clouds, and there are some other birds as background clutters. Hence, deformation and occlusion are two important attributions for this sequence (e.g. #91, #186, and #278). As observed from Figure 7, only our ASTRCF tracker (red box) keeps grasping the target accurately. In the "MotorRolling" sequence, the motor's movement changes dramatically within three frames. Even the target is surrounded by the clutters closely, our tracker (red box) can still succeed to track the target accurately, while HDT (blue box), BACF (black box), and STRCF (green box) trackers fail to locate the target (e.g. #26 and #121). In the "MountainBike6" sequence, our ASTRCF tracker does not lose the target no matter how the bike's movement changes. In the "Bird1" sequence, even interfered by deformation and occlusion, our proposed ASTRCF tracker can still track the target accurately. While HDT (blue box), BACF (black box), and STRCF (green box) trackers sometimes fail to locate the target.

Fast motion (FM)
In order to further evaluate the performances of our ASTRCF tracker in handling the target with fast motion, we compare our proposed ASTRCF tracker with HDT, BACF, and STRCF trackers on nine frameso fthree challenging video sequences with large fast motion. The tracking results are shown in Figure 8. As observed from Figure 8, only our ASTRCF tracker (red box) keeps grasping the target accurately. In the "Surf-ing12" sequence, the ship are moving fast, which leads to the fast motion of the target (e.g. #21, #63, and #109). Our ASTRCF tracker can track the target accurately, while HDT, BACF, and STRCF trackers emerge tracking drift (e.g. #63 and #109). In the "Surfing03" sequence, besides fast motion, inplane rotation (e.g. #39) and the scale variation (e.g. #64 and #111) also occur. Obviously, our proposed ASTRCF tracker can still track the target accurately, while HDT, BACF, and STRCF trackers emerge tracking drift (e.g. #64 and #111). In the "DragonBaby" sequence, baby moves so fast that motion blur occurs (e.g. #50). For this case, our proposed ASTRCF tracker can still grasp the face of baby accurately, while HDT, BACF, and STRCF trackers emerge tracking drift (e.g. #50 and #93).

Occlusion (OCC)
Generally, the object is easily susceptible to heavy occlusion. Figure 9 shows the tracking results of our proposed ASTRCF As observed from Figure 9, only our ASTRCF tracker (red box) keeps grasping the target accurately. In the "Sup5" sequence, it is seen that even in the case of background disturbance, our tracker still lock the target precisely (e.g. #24, and #140). While the other three trackers, especially HDT and BACF trackers, appear the phenomena of tracking drift. In the "Rcccar4" sequence, only our tracker tracks the target well (e.g. #34, and #80). While the other three trackers lose the target obviously. In the "Human3" sequence, the target is constantly occluded by different objects and some other pedestrians. Our tracker and HDT tracker grasp the target well, while BACF and STRCF trackers lose the target obviously.

Scale variation (SV)
Generally, the target often suffers from the scale variation in the tracking processing. Figure 10 shows the comparison results of our proposed ASTRCF tracker with HDT, BACF, and STRCF trackers on nine frameso fthree challenging video sequences with scale variation. As observed from Figure 10, our ASTRCF tracker (red box) can always keep grasping the targets accurately. For example, in the "Bicycle1" sequence, even the target has the scale variation, our tracker (red box) and BACF (black box) can still succeed to track the target, while HDT (blue box) and STRCF (green box) trackers emerge tracking drift (e.g. #145 and #215). In the "Bicycle2" sequence, our ASTRCF tracker does not lose the target no matter how the bicycle's scale changes. Furthermore, the scale of tracking box by our tracker can be adjusted according to the object's size.

CONCLUSIONS
This paper has proposed an ASTRCF tracker for visual tracking. The proposed tracker designs an improved appearance model for learning the correlation filter based on ASRCF tracker with regularization theory. The appearance model contains a background awareness term, an adaptive spatial regularization term, and a temporal regularization. Therefore, it can not only take full advantage of background information, the spatial information, and the adaptivity of weight matrix, but also use the temporal information for optimizing the correlation filter. The designed appearance model can improve the robustness of our method against the occlusion, large appearance deformation, fast motion and background clutter effectively. Extensive experimental results illustrate that our proposed ASTRCF tracker is superior to several state-of-the-art trackers in terms of accuracy and robustness.