Kernel correlation filter tracking strategy based on adaptive fusion response map

Funding information Tianjin natural science foundation, Grant/Award Numbers: 18JCYBJC88300, 18JCYBJC88400; National natural science foundation of China, Grant/Award Number: 61873186; Tianjin high school innovation team training program, Grant/Award Number: TD13-5036 Abstract Aiming at the problem that the tracking performance of the traditional kernel correlation filter tracking algorithm is easy to be affected by illumination variation, occlusion and motion blur during tracking, an improved tracking strategy is proposed. A new Histogram of Hue Gradient (HHG) feature is designed, and the new HOG-HHG feature is obtained by connecting the HOG and the HHG in series. Two features, CN and HOG-HHG, are extracted respectively, and two kernel correlation filter classifiers are constructed base on the two features above to establish the corresponding response maps of the tracking scenes, respectively. The response maps are fused adaptively to improve the tracking robustness to the complex situations in the tracking process. The updating strategy of the target model is designed based on peak sidelobe ratio (PSR) and its difference, and the adaptive thresholds are used to improve the stability of the target model. Simulation results show that the proposed method has better tracking adaptability to the illumination variation, occlusion and motion blur. Both the precision and the success rate can be enhanced.

the manual labelling problem, however it demands the platform to be very powerful and its tracking accuracy will be degraded.
Generally, the tracking accuracy of the methods based on deep learning can be greatly improved, but the large computation cost or workload to label samples constrains the application of these tracking strategies in the real-time detecting and tracking field [13][14][15][16][17][18][19][20][21][22][23][24][25]. In addition, these tracking strategies are also not suitable for tracking target which is outside the training sample set. On the contrary, in many engineering fields, the correlation filtering methods are still popular due to its low complexity, fast operation speed and real-time performance [26][27][28][29][30][31]. For instance, Minimum Output Sum of Squared Error (MOSSE) method was the earliest application of correlation filtering method in the tracking field. In MOSSE, only one sample image of the target area is used to train the target model in the current frame, the similarity of the candidate area in the next frame is calculated by the trained target model, and the discrete Fourier transform is used to convert the similarity computation into the frequency domain to accelerate the computing speed [32][33][34][35][36][37][38]. However, its tracking accuracy needs to be further improved. The ridge regression is introduced in circulant structure kernels (CSK) tracking method, and the cyclic shift is used to perform the intensive sample. However, the single grayscale feature is used in CSK method, so the feature description of the target is insufficient [39][40][41][42][43][44][45]. Based on CSK method, kernel correlation filter method (KCF) is proposed by replacing the grayscale feature with Histogram of Oriented Gradient (HOG) feature and extending the single channel into multiple channels to improve the tracking performance. However, KCF still lacks the adaptability to the occlusion, motion blur and illumination variation because only single HOG feature is used [46][47]. Multi feature response map fusion is an effective method to improve the adaptability of the tracking method. For instance, in the Staple method [48], two response maps based on colour histogram feature and HOG feature are fused by the fixed weights. However, fixed fusion weights cannot adapt to the different tracking scenes [48][49][50].
For this, an improved kernel correlation filter tracking method is proposed to improve the adaptability of the tracking method. Since HOG feature depends strongly on the spatial layout of the tracked target, it is notoriously sensitive to deformation. Therefore, a new feature HHG is established, and a new joint feature HOG-HHG is constructed based on the connection of the HOG and HHG to complement the insufficiency of HOG feature. Two classifiers are trained separately by CN feature and the joint feature. And two corresponding feature response maps obtained by the two classifiers are adaptively fused to determine the final position of the target. Furthermore, in order to improve the stability of the target model, peak sidelobe rate (PSR) is used to measure the disturbed degree of the target model, and the difference of PSR is used to measure the change rate of the target model. A new model update criteria is designed according the PSR and its difference to improve the tracking performance.

IMPROVED KCF TRACKING METHOD
Feature selection is a key technology in the tracking algorithm. HOG feature is used to train classifier in traditional KCF tracking method, which is robust to the illumination variation. However, the texture of the target becomes unclear due to motion blur, serious illumination variation and occlusion, so the single HOG feature cannot express the effective information of the target. Therefore, the hue feature in the HSV space and CN feature can be used as the complementary features to improve the accuracy of the target model. In this paper, the hue and its gradient are used to construct a new Histogram of Hue Gradient (HHG) which is designed as the same as HOG. HHG feature is inherently robust to deformation and contains the colour information of the target, so HHG and HOG are connected in series to form a new HOG-HHG feature which will be robust to the illumination variation and deformation. However more detailed information about the colour of the target cannot be expressed by only HOG-HHG feature, so CN feature is also used as the independent auxiliary feature to enhance the robustness of the tracking method in the tracking process. Furthermore, in order to improve the stability of the target model, a new model updating criteria based on PSR and its difference is designed. Thus, the improved target tracking method will have the good adaptability to illumination variation, motion blur and occlusion. It can be used in some complex situations to track the target.

Classifier construction
The traditional kernel correlation filter tracking method generates a large number of samples by cyclic shifts of the sample in the target area and its neighbourhoods in the current frame, and the training sample set is constructed by sample feature set and their corresponding expected output values. The ith sample in the training sample set is expressed as (x i , y i ), where x i represents the HOG feature of the ith sample and y i represents the expected output value corresponding to x i . The classifier is constructed to find a function f(x i ) = w T x i that minimizes the squared error between the actual output value and the expected output value. That is: where, λ is the regularization parameter. In most cases, the training sample set is linearly inseparable, so the sample feature set can be mapped into the kernel space by the kernel correlation filter method. In this way, the linear inseparable problem can be resolved. Thus, w can be represented as: where, φ(x i ) is the kernel function which transforms the training sample x i into the kernel space, and α is the classifier coefficient vector in the dual space as opposed to the primal space w. Therefore the classifier can be determined by resolving α. In order to reduce the calculation, the solution process can be completed by discrete Fourier transform as [46]: where, a hatˆdenotes the DFT of a vector, and k xx is the first row vector of kernel matrix. Feature selection plays an important role in the design of the classifier. HOG feature has good robustness to illumination variation, but it is sensitive to the deformation of the target. HOG feature is a gradient histogram feature. In the kernel correlation filter tracking method, the image is segmented into some cells, and in each cell the gradient information of each pixel is calculated and normalization and truncation are performed. And then principal features are extracted by PCA to reduce the feature dimension. Thus, 31 dimensional features can be obtained [49].
The magnitude and the angle of the gradient are calculated as: where, G(x′,y′) represents the grey gradient magnitude of the pixel point (x′,y′), θ(x′,y′) represents the gray gradient angle of the pixel point (x′,y′), G x' (x′,y′) represents the horizontal grey gradient of the pixel point(x′,y′), and G y' (x′,y′) represents the vertical grey gradient of the pixel point (x′,y′). HOG feature only uses the grey value in RGB colour space. In contrast, hue feature in HSV colour space and CN feature have good invariability to the target deformation. In order to precisely express the colour feature of the target area, a new HHG feature based on the hue feature in HSV colour space and CN feature are used to provide the extra supplemental colour information.
For the HHG feature, the hue gradient of every pixel is calculated in each cell, and form a 31-dimensional feature vector based on the hue angel instead of the gradient angel. Thus, the new HHG feature of the sample i can be expressed as u i .
HHG and HOG of sample i are fused to construct a hybrid feature L i by the serial connection as: Thus, based on HOG-HHG feature, Equation (1) can be reexpressed as: When the target is occluded partially, CN feature can describe the colour information of the target well. Therefore, CN feature is used as the auxiliary feature to train another classifier. Let C i represents the CN feature of the i-th sample, thus, based on CN feature, Equation (1) can be re-expressed as: Based on Equations (7) and (8), two new classifiers can be expressed separately as: where,̂H OG −HHG is the classifier based on HOG-HHG feature, and̂C N is the classifier based on CN feature.

Target detection
Based on the classifiers in Equations (9) and (10), the two response maps of the candidate area can be obtained as: where, f HOG-HHG and f CN represents the response maps of the HOG-HHG feature and CN feature respectively. k xz HOG −HHG represents the kernel correlation between the target sample x and the candidate area sample z based on HOG-HHG feature, k xz CN represents the kernel correlation between the target sample x and the candidate area sample z based on CN feature, and F -1 represents inverse discrete Fourier transform.
In the practical tracking process, the foreground and the background will change continuously. It is difficult to accurately determine the target position according to the single feature response map. Therefore, HOG-HHG feature response map and CN feature response map are adaptively fused to locate the target accurately. The fusion response map can be described as: where, f(z) is the fusion response map. The position where the response value is maximum is determined as the target in the fusion response map. The adaptive fusion weights are set as: where, a and b are the weights of the HOG-HHG feature response map and the CN feature response map, respectively, max(f HOG-HHG (z)) is the maximum value of the HOG-HHG feature response map, and max(f CN (z)) is maximum value of CN feature response map. The adaptive weights in Equations (14) and (15) can strength the role of the feature response map which has bigger maximum response value. Thus, the main feature can play the more important role to locate the target.
The panda sequence is taken as an example. Figure 1 shows the weights change of the HOG-HHG feature response map and the CN feature response map in the tracking process. Figure 2 shows the tracking results of three tracking strategies based on HOG-HHG feature response map, the fixed weight fusion feature response map and the adaptive fusion feature response map. In the tracking strategy based on the fixed weight fusion feature response map, the weights are set as: a = 0.3 and b = 0.7.
In Figure 2, the red boxes, the blue boxes and the green boxes are the tracking results obtained by the tracking strategies based The panda sequence has some attributes such as occlusion and low resolution. From the tracking results, only the tracking strategy based on the adaptive fusion feature response map can complete the target tracking. Therefore, the adaptive fusion feature response map can improve the tracking performance significantly.

Model updating
In order to adapt to illumination variation, motion blur and occlusion, both the target appearance model and the classifier coefficient vector need to be updated in the tracking process. Generally, the update strategy can be described as: where, γ is the learning rate,x n and̂n are the target appearance model and the classifier coefficient vector in the current frame respectively,x n−1 and̂n −1 are the target appearance model and the classifier coefficient vector in the previous frame respectively.
In the traditional KCF tracking method, the target model will be updated according to the new target area every time after the target is detected. However, when the target model is occluded or the illumination varies, the target updating operation will cause the target model instability, even tracking failure. Therefore, the target model should only be updated when the target is undisturbed. For this, a new model updating criteria is designed to enhance the tracking performance.
Generally, Peak sidelobe ratio (PSR) can be used as a valid model updating criteria. When PSR is bigger than the set threshold value, which means the target area is not disturbed, the model should be updated. Otherwise, when PSR is smaller than the set threshold value, which means the target area can be disturbed, the model updating should be stopped. However, the single PSR can not completely describe the variance process of the target model. And the difference of PSR can indicate the variance tendency of the target model. Therefore, both PSR and its difference ΔPSR are used as the new model updating criteria, and their dynamic threshold values are used to enhance the adaptability of the tracking method.
PSR is defined as: where, p is the maximum response in the peak sidelobe area, u is the mean in the peak sidelobe area, and σ is the standard deviation in the peak sidelobe area. The difference of PSR can be calculated as: where, PSR(n) is the PSR of the frame n. ΔPSR can be used to measure the interference process of the target area. Thus, only when both PSR and ΔPSR are bigger than their threshold values, the target model can be updated. Considering the complexity of the target model variance, the adaptive threshold values of PSR and ΔPSR are chosen to adapt to various cases.
The PSR threshold value V PSR is set as: where, M 1 is the average value of PSR of the last three frames.
The threshold value curve in Equation (20) is shown in Figure 3.
When the target is disturbed continuously, M 1 will be decreased. From Figure 3, the threshold value V PSR will be increased to avoid updating the model, which can improve the stability of the target model. Furthermore, when the target is worse disturbed, the PSR value will be decreased continuously, and ΔPSR will become negative. For this, ΔPSR threshold value V ΔPSR is set as: where, M 2 is the average value of ΔPSR of the last three frames: The threshold value curve V △PSR in Equation ( (22) is shown in Figure 4.
When the target is continuously disturbed, for instance the target is occluded, M 2 will gradually decrease and become negative, and the target model should not be updated. From Figure 4, a smaller M 2 corresponds to a larger threshold. When M 2 is positive, PSR value will be increased. In this case, whether the target model needs to be updated depends on PSR and ΔPSR. Figure 5 shows the changes of PSR, ΔPSR and their thresholds in the tracking process of the bird2 sequence.
From Figure 5, PSR is relatively high from frame 20 to frame 40, which means that the target position predicted is high confident in these frames. Meanwhile, ΔPSR varies around zero from frame 20 to frame 40, which means that the target is not continuously disturbed. Therefore, the model should be updated in these frames. Figure 6 shows the tracking results of bird2 sequence. From Figure 6, the appearance of target has changed obviously in the frame 47, and PSR continues to decrease in the next frames. In order to keep the stability of the target model, both

2.4
The overall process of tracking Figure 7 shows the tracking block diagram of the proposed method. In Figure 7, HOG-HHG feature and CN feature of the target area are extracted from the initial frame, and cosine windows are added into the two feature maps to solve the boundary effect problem. In addition, in order to reduce the computation, the classifiers are trained separately by the two extracted features in the Fourier frequency domain, and feature response maps corresponding to the two features can be obtained by the trained The tracking block diagram of the proposed method classifiers. The value of the each pixel in the feature response map represents the target regional probability. Thus, the maximum value of the response map can be determined as the target position. And when the model updating criterion is meet, the model of the target will be updated.
The pseudo code of its tracking method can be describe as :

SIMULATION EXPERIMENTS
In this paper, the size of the candidate region is twice as big as that of the target area in the previous frame, the regularization parameter is set as 10 -4 , the model update parameter is set as 0.075, and a cell size is set as 4×4. The experiments are executed on a computer with Intel Core i7-10750H and NVIDIA GTX1650GPU. The performance of the tracking algorithm is tested for three attributes: occlusion, motion blur and illumination variation. Eight tracking methods, ECO [8], DeepSRDCF [7], UDT [12], CSK [39], DSST [40], KCF [46], Staple [48] and SRDCF [50], are chosen as the comparison methods. Tracking performances are tested on Benchmark test platform.
The tracking precision is expressed by the centre location error (CLE) which is defined as: where, (x r , y r ) is the central position of the target area predicted by the tracker, and (x ′ r , y ′ r ) is the ground truth.

Tracking test results
Three sequences with different attributes, such as illumination variation, motion blur, and occlusion, are selected to test the tracking performance of the proposed method. The tracking results of different methods are shown by different color boxes. Black boxes, green boxes, dark blue boxes, pink boxes, light blue boxes, red boxes, gray boxes, dark red boxes, and orange boxes represent the tracking results obtained by CSK, DSST, KCF, Staple, SRDCF, the proposed method, UDT, ECO, and DeepSRDCF, respectively.

Occlusion
The target in Jogging sequence is occluded, and Figure 8 shows the tracking results of Jogging sequence. Figure 9 shows the centre location errors of Jogging sequence.
From Figures 8 and 9, the target is not occluded from frame 30 to frame 60, so all nine methods above can accurately track the target. However, after frame 60, the target enters the occlusion area, and the centre location errors of some comparison methods are getting bigger and bigger. In contrast, when the target is occluded, both PSR and ΔPSR will be smaller than their thresholds, and the model updating will be stopped to avoid the target model drift. Therefore, the proposed method has smaller centre location errors in the whole tracking process, and the tracking performance can be improved. From Figure 9, in the most frames, the centre location errors obtained by proposed method are less than 20 pixels.

Illumination variation
Shaking sequence which has significant illumination variation attribute is chosen as the test sequence. Figure 10 shows the tracking results of Shaking sequence obtained by different tracking methods. Figure 11 shows the centre location errors of Shaking sequence.
From Figures 10 and 11, the illumination significantly varies from frame 30 to frame 100. The proposed method can obtain more accurate tracking results than other methods, because the adaptive fusion response map based on HOG-HHG response map and CN response map can improve the tracking robustness to illumination variation. From Figure 11, the average centre location error is smallest among others. Therefore, the proposed method has the better illumination invariance. Motion blur The target in Blurowl sequence is blurred. Figure 12 shows the tracking results obtained by different tracking methods. Figure 13 shows the centre location errors of Blurowl sequence.
From the Figures 12 and 13, the ECO method, the Deep-SRDCF method, the UDT method, the Staple method and the proposed method can track the target successfully from frame 170 to frame 220. Other methods fail to track the target because of their poor adaptability to complex situations. Compared with the proposed method, the Staple method cannot continuously adapt to the blurred situations in the whole tracking process due to the fixed fusion weights, so the centre location errors of Staple method are larger than those of the proposed method. The proposed method can get good tracking results similar to the methods based on deep learning.

3.2
Tracking performance analysis 27 sequences which at least have one of three attributes (occlusion, illumination variation and motion blur) are selected as test sequences shown in Table 1. There are seven test sequences with illumination variation, 14 sequences with occlusion, and 12 sequences with motion blur in Table 1. Precision plots and success plots of one-pass evaluation are used to evaluate the tracking performance of the methods above. Figure 14 shows the precision plots of the trackers for different attributes. Figure 15 shows the success plots of the trackers for different attributes.
From Figures 14 and 15, the proposed method shows obviously higher precision and success rate than traditional tracking methods. Because the tracking methods based on deep learning have some advantages in the target recognition, for instance, ECO method and DeepSRDCF method can get slightly better tracking performance than the proposed method. However, the tracking performances of UDT method, which is also based on deep learning, and the proposed method are about the same. Figure 16 shows the precision plots and success plots of all the 27 test sequences.
From Figure 16, the tracking average performance of the proposed method is superior to that of UDT method and other traditional tracking methods. The average precision percentage and the average success rate percentage of the proposed method are 82.8% and 80.7%, respectively. That is, the tracking average performance of the proposed method is just slightly lower than that of ECO method and DeepSRDCF method. Thus it can be seen that the adaptive fusion response map can accurately determine the target position. Meanwhile, the model updating strategy can improve the stability of the target model in the complex tracking scenes.
Though the tracking methods based on deep learning have higher tracking precision, the large computation cost is inevitable. Such as, the average tracking speed of the ECO method is 8FPS [8], the SRDCF method is about 4FPS [50], and the average tracking speed of the DeepSRDCF method is lower than that of the SRDCF method because of the extraction of deep feature. UDT method, as an unsupervised deep learning method, can meet the real-time requirements, but its average tracking speed is just 70FPS [12]. In contrast, the traditional tracking methods have some advantages in tracking speed. For instance, the average tracking speed of the KCF method is 172FPS [46] and it has the highest computation speed among the comparison methods above. Although the feature dimensions are increased in the proposed method, the average speed is still about 131FPS on our computation platform. Therefore, the proposed method has not only the higher tracking precision but also the higher tracking speed.

CONCLUSION
Kernel correlation filter tracking method is susceptible to illumination variation, motion blur, and occlusion in the tracking process. For this, an improved kernel correlation filter tracking method is proposed to improve its adaptability. Two response maps based on HOG-HHG feature and CN feature are adaptively fused to locate target position. Furthermore the new model update strategy based on PSR and ΔPSR is designed to improve the stability of the target model. The experimental results show that the proposed method has better adaptability to occlusion, illumination variation and motion blur. The tracking precision and the tracking success rate can be improved significantly.

ACKNOWLEDGMENTS
This work is supported by Tianjin natural science foundation (18JCYBJC88300, 18JCYBJC88400), National natural science foundation of China (61873186) and Tianjin high school innovation team training program (TD13-5036).