Visual tracking using interactive factorial hidden Markov models

The authors present a novel tracking algorithm based on a factorial hidden Markov model (FHMM) that can utilise the structured information of a target. An FHMM consists of multiple hidden Markov models (HMMs), wherein each HMM aims to represent a different part of the target. Then, the geometric relation between patches is encoded in the FHMM framework via either interactive sampling or importance sampling over sets. Experimental results demonstrate that the proposed method qualitatively and quantitatively outperforms other methods, especially when the targets are highly deformable.


| INTRODUCTION
Visual tracking is widely used for various computer vision systems such as autonomous driving, surveillance systems, and robotics systems. Visual tracking methods can be categorised into two groups, which are generative and discriminative model-based visual trackers. Generative model-based visual trackers estimate posterior probabilities of target objects given observations from an initial frame to a current frame, in which the target objects are typically described by the models based on sparse representation [1,2]. Then, they find a target region in the upcoming frames using the posterior probabilities. Discriminative model-based visual trackers distinguish between foreground regions from background regions by training discriminative classifiers. Then, these trained classifiers find object regions in the upcoming frames via binary classification [3][4][5][6]. In particular, correlation filters in the Fourier frequency domain have shown great success in visual tracking, because of their computational efficiency [7].
Recently, owing to the high representation ability of deep features [8], deep-learning-based visual trackers [9] have demonstrated considerably improved visual tracking accuracy and shown promising results in real-world environments. However, it remains difficult for them to track highly non-rigid targets, where the geometric structures of the targets can significantly vary over time.
This difficulty fundamentally originates from an inaccurate representation of the targets. For example, a target is typically described by a single bounding box, which is represented by a centre position (x, y) and a scale s. Then, the inference process is usually formulated using a single hidden Markov model (HMM) with state X t = (x, y, s) and observation O t for t = 1, …, T, as shown in Figure 1(a). However, this bounding-box representation has three degrees of freedom. Thus, it cannot take into account notable appearance changes caused by arbitrary deformation.
To handle highly deformable targets, the authors propose to represent the target as a combination of multiple patches with 3 � K degrees of freedom, wherein the state of each patch is inferred by a different HMM, as shown in Figure 1(b). To encode geometric relations among patches, all HMMs interact with each other using the proposed interaction strategies.
The contribution of this method is twofold: � The authors propose a novel visual tracking system based on an FHMM that can efficiently infer the state of each part of the target. The proposed FHMMs can accurately describe deformable objects using multiple HMMs, in which each HMM can deal with each part of the deformable objects. In contrast, a conventional HMM fails to accurately capture deformation of the objects, because a single HMM can only describe an object as a whole and cannot exploit geometric relations between object parts. � Two interaction strategies among multiple HMMs are presented, namely inference methods via interactive sampling (IS) [10] or importance sampling over sets (ISS) [11]. Using these strategies, HMMs can communicate with each other and share the information on the accurate states of the target. In addition, multiple Markov chains in FHMMs prevent the proposed visual tracker from becoming trapped in local optima states and help the visual tracker to find a global optimum state. In contrast, a conventional HMM has no mechanism on interaction with other Markov chains, and thus it cannot make visual tracking methods to escape from local optima states. Conventional methods [10,11] apply IS or ISS to Markov chain Monte Carlo, whereas the proposed method adopts IS or ISS for FHMMs. � It is shown that the proposed method considerably outperforms non-deep-learning-based trackers. In addition, although the visual trackers do not employ deep neural architectures, the proposed FHMM-ISS is comparable with state-of-the-art visual trackers.
The differences compared with relevant visual tracking algorithms are discussed in Section 2. The proposed FHMM is explained in Section 3, while two interaction strategies, namely IS and ISS, are introduced in Sections 4 and 5, respectively. In Section 6, the proposed visual trackers are numerically compared with state-of-the-art-visual tracking methods. The conclusion is provided in Section 7.

| RELATED WORK
In this section, visual tracking methods are introduced, which are closely related to the proposed method (i.e. visual tracking methods based on a HMM in Section 2.1, visual trackers for deformable objects in Section 2.2, visual tracking methods using deep features in Section 2.3, and generic visual tracking methods in Section 2.4).

| HMM-based visual trackers
Wang et al. [12] proposed an FHMM to fuse the results of multiple visual trackers. Yuan et al. [13] used a standard HMM to aggregate multiple visual trackers. In contrast to these methods, the proposed method uses multiple HMMs to track each part of the target. Park et al. [14] introduced an autoregressive HMM to encode the relations among varying appearances of the target over time. Ha and Kwon [15] proposed a hidden semi-Markov model (HSMM)-based visual tracker to enhance the motion model via trajectory simulation. However, these methods use a single HMM for visual tracking. In contrast to these methods, the proposed method employs multiple HMMs and has an additional mechanism that enables all HMMs to interact with each other.
Mei and Porikli [16] also presented a visual tracking method based on FHMMs. However, the FHMMs consists of only two HMMs for tracking and registration, respectively. In addition, they mutually interacted with each other. Thus, it is limited to propagate a good state of one HMM to others. In contrast, the authors' FHMMs consists of five HMMs, which are fully interacted using explicit interaction techniques, IS and ISS. In addition, each HMM was modelled to describe each part of an object.

| Visual trackers for deformable objects
Sun et al. [17] presented a novel kernelised correlation filter that can preserve shape information for deformable objects. Lukezic et al. [18] proposed a correlation filter-based visual tracker that adopts the constellation model to describe each part of a target. Zhao et al. [19] adopted a mixture of a Gaussian model to handle partially occluded objects and introduced a deformable part model. Du et al. [20] considered dependences among parts of objects via a graph and accurately tracked deformable objects. Gao et al. [21] predicted a target location using part locations using a deep regression model. Kwon and Lee [22] proposed a Bayesian visual tracking method for highly non-rigid objects, which describes a target as multiple parts and exploits the relations between two parts. Zhaowei et al. [23] constructed a dynamic graph and matched the estimated graph with the target graph. However, these methods have no explicit interaction mechanism.
In contrast to these methods, the proposed method focuses more on the interaction among the parts of objects than F I G U R E 1 Conceptual difference between a conventional HMM and the proposed FHMM in the inference of target configurations for visual tracking: (a) conventional methods infer a target configuration (x, y, s) using a single HMM; (b) the proposed method infers K target configurations, (x k , y k , s k ) for k = 1, …, K, using the proposed FHMM, which consists of K HMMs. In the proposed method, each HMM aims to track a part of the target, while all HMMs communicate with each other via two interaction processes. FHMM, factorial hidden Markov model; HMM, hidden Markov models conventional methods and formulates two interacting algorithms in the FHMM framework.

| Visual trackers using deep features
Deep neural networks [24][25][26] enabled conventional visual tracking methods to extract more representative features, which significantly improves the accuracy of visual trackers, as demonstrated by [27][28][29][30][31]. Ma et al. [27] presented hierarchical convolutional features to represent a target as multiple levels of abstraction and accurately tracker the target. Nam et al. [28] transformed visual tracking problems into binary classification ones and improved the visual tracking accuracy using multidomain features. Wang et al. [29] introduced a sequential training strategy for online visual tracking. Wang and Yeung [30] presented a stacked denoising autoencoder to extract deep features and classified each image patch into foreground or background. Kwon et al. [31] integrated deep neural networks into Monte Carlo sampling to infer multiple variables simultaneously. However, these methods did not explicitly deal with highly deformable objects.
By contrast, the proposed method adopted deep features for a novel FHMM to deal with deformable objects.

| Generic visual tracking methods
Isard and Blake [32] introduced a Bayesian framework based on particle filtering for visual tracking, which approximated non-Gaussian distributions using particles. Khan et al. [33] presented a Markov chain Monte Carlo-based visual tracking, where high-dimensional posterior distributions were efficiently estimated. Kwon et al. [34] proposed an uncertainty calibrated MCMC, in which the uncertainty in the Monte Carlo sampling was minimised for accurate inference of posterior distributions. Gustafsson et al. [35] presented a deep probabilistic regression method based on Monte Carlo sampling to approximate the likelihood distribution and optimise an energy model for visual tracking.
In contrast to these methods, the proposed method accurately estimated a target posterior distribution using multiple Markov chains and can prevent the visual tracker from becoming trapped in local optima using multiple Markov chains.

| VISUAL TRACKING BY FHMM
In HMM, a target state X is represented as a set of states whose length is T, that is X ¼ X 1 ; …; X T f g, while an observation O is expressed as a set of image features within bounding boxes across T frames, that is Thus, the number of states is equivalent to the number of observations. Then, the joint distribution for the HMM is formulated as where p O t jX t ð Þ and p X t jX t−1 ð Þ denote the likelihood and transition probability of the HMM at time t, respectively.

Algorithm 1 FHMM-IS
for k = 1, …, K at the previous frame. Output: b X ðkÞ for k = 1, …, K at the current frame. 1: for k = 1 to K (for each part) do 2: for t = 1 to T do 3: Move to the next state using p X where deep features are extracted from the 14th feature map of the VGG-19 network [8] as the observed features, which was pre-trained by the ImageNet dataset [37]. We adopt the correlation filter used in [9] as the distance metric. Then, we design the transition probability as follows: In all experiments, Σ is fixed to the diagonal matrix with diagonal entries (σ x , σ y , σ s ), where σ x , σ y , and σ s denote the variances for x, y, and scale coordinates, respectively. Note that σ x , σ y , and σ s hyperparameters, which are set to 0.01, 0.01, and 0.1, respectively.
The goal of an FHMM is to find the best state b X ðkÞ for each HMM (i.e. for each part of the target) that maximises the for k = 1, …, K and t = 1, …, T.

Algorithm 2 FHMM-ISS
for k = 1, …, K at the previous frame. Output: b X ðkÞ for k = 1, …, K at the current frame. 1: for k = 1 to K (for each part) do 2: for t = 1 to T do 3: Move to the next state using p X is the likelihood of the k-th HMM at time t. Note that the state of the k-th HMM cannot accurately describe the states of other parts, although all parts of a single target are located close to each other. However, the state of the k-th HMM prevents the states of other parts from drifting into the background. The proposed IS method executes the interaction mode with probability γ ¼ 1 − 0:5 T t, where t and T denote the iteration index and the total number of iterations, respectively. The value of γ linearly decreases, as the sampling proceeds. Figure 2 conceptually illustrates the IS method for the interaction among multiple HMMs. Algorithm 1 shows the whole process of the proposed visual tracking method with IS (i.e. FHMM-IS).

| INFERENCE VIA ISS
The proposed ISS for FHMM is a variant of the one reported in [11]. Using ISS, each FHMM calibrates the corresponding likelihood value using the partition function Z t . With this modification, each HMM is affected by other HMMs. For this, ISS defines a set of states and estimates a statistic of the set (i.e. Z t ), which encodes the likelihoods of all HMMs at time t: F I G U R E 2 Interaction among multiple hidden Markov modelss using interactive sampling  (7), a set-proposal function q is designed that proposes a set of new states based on a set of previous states. Then, the transition probability in (4) can be substituted by q as follows: where G is a multivariate Gaussian function and N is defined in (4). The probability of X t being included in the set X ðkÞ t n o can be computed, and is sampled by q, as follows: where the function I outputs unity, if X t ∈ X ðkÞ t n o . Using β X t ð Þ, the unbiased partition function Z t in (7) can be estimated. The entire proof can be found in [11]. Figure 3 conceptually illustrates the ISS method for the interaction among multiple HMMs. Algorithm 2 shows the whole process of the proposed visual tracking method with IS (i.e. FHMM-ISS).

| EXPERIMENTS
The authors used five parts of the target (i.e. K = 5) and 100 samples for each part (i.e. T = 100), and randomly initialised the configuration of each part. The proposed methods were compared to state-of-the-art deep learning-based methods, namely SiamRPN++ [38], TADT [39], DAT [40], SiamDW [41], SINT [42], SINT-op [42], C-COT [43], ECO [44], and ECO-HC [44]. These trackers were evaluated through 50 test sequences in the OTB dataset [45]. The OTB test data set is a standard visual tracking benchmark data set and contains either 50 (OTB-50) or 100 (OTB-100) challenging sequences with several different attributes, which are out-of-view objects, low resolution, background clutter, out-of-plane rotation, motion blur, fast motion, deformations, occlusions, in-plane rotation, scale variation, and varying illumination. Visual trackers were evaluated using the precision, success plots, and AUC [45]. The precision plot is the percentage of frames, where the distance between the bounding box estimations and ground truth is less than a threshold. The success plot is the percentage of frames, where intersection of union between the bounding box estimations and ground truth is greater than a threshold. The AUC is the area under the success plot. The authors conducted all experiments using Intel CPU i7 3.60 GHz and GeForce RTX 2080. Figure 4 qualitatively compares the proposed methods using the OTB-50 dataset. Blue, red, yellow, green, and white boxes show the states of five parts of the target. As shown in the figure, FHMM-ISS accurately localised all parts via the interaction process, that is ISS. The method was analysed in a component-wise manner using the OTB-50 data set and verified the sensitiveness of several hyperparameters used in the proposed method. Thus, the proposed method was not compared with other methods here, but three variants of the proposed method were developed, namely FHMM, FHMM-IS, and FHMM-ISS. FHMM denotes the baseline tracker, which employs the proposed factorial HMM as an inference method. FHMM-IS and FHMM-ISS denote improved versions of the baseline tracker, which is combined with interactive sampling and importance sampling over sets, respectively. As shown in Figure 5, FHMM-IS and FHMM-ISS outperforms FHMM, thereby demonstrating the effectiveness of the proposed interaction strategies for multiple HMMs. The interaction method based on ISS was the best in terms of precision and success plots.

F I G U R E 3
Interaction among multiple hidden Markov modelss using importance sampling over sets PAENG AND KWON Figure 6 demonstrates the effectiveness of the proposed interaction strategy, ISS. At the frame #99, the white-coloured FHMM failed to track a certain part of the target and became drifted into the background. However, the proposed method prevented the white-coloured FHMM from being further drifted and recovered its state with the help of other FHMMs using the proposed interaction mechanism, ISS. Table 1 shows the tracking accuracy of the proposed method in terms of AUC, as the number of parts (i.e. FHMMs) increases. If a single part or FHMM is used, the tracking accuracy dropped considerably, which demonstrates the effectiveness of using multiple FHMMs. However, using many parts or FHMMs also deteriorated the visual tracking performance in terms of speed and accuracy. Nevertheless, the proposed FHMM-ISS was not significantly sensitive to the number of parts. Table 2 shows the tracking accuracy of the proposed method in terms of AUC, as the number of samples (i.e. the number of states in FHMMs) increases. As shown in Table 2, similar results were obtained even with a different number of samples, which implies that the effectiveness of the proposed method mainly stems from multiple FHMMs and novel interacting mechanisms. Table 3 shows the speed of this method in terms of frames per second. The proposed FHMM-ISS performed in real time, although the authors used 700 samples in total. The proposed interaction strategies and inference processes of FHMMs do not require high computation costs. Table 4 evaluates the computational costs of recent state-of-the-art visual trackers using the OTB data set. The visual tracker runs in real time and is the fastest algorithm, because the proposed method does not adopt complex deep neural network architectures.

| Quantitative comparison
The authors quantitatively compared the proposed method with non-deep-learning-based trackers using the OTB-50 data set, as shown in Figure 7. This method considerably outperformed conventional visual trackers. In particular, the proposed FHMMs can accurately describe deformable objects using multiple HMMs, in which each HMM can deal with each part of the deformable objects. Note that a conventional HMM cannot accurately capture deformation of the objects, because it cannot exploit geometric relations between object parts.
The authors quantitatively compared the proposed method with recent deep-learning-based trackers using the OTB-50 data set, as shown in Figure 8. Although the visual trackers do not employ deep neural architectures and only utilise deep features, the proposed FHMM-ISS is comparable with stateof-the-art visual trackers in terms of both precision and success rate. In particular, the proposed visual tracker accurately tracked highly deformable objects by modelling multiple parts using FHMMs and interacting FHMMs. With high probability, at least one FHMM can accurately track the corresponding part of a target and its state can be used to re-initialise the states of other FHMMs. SiamDW and ECO were the best in terms of precision and success plots, respectively. However, these trackers adopted deep neural network architectures, which required high computation costs. Table 5 quantitatively compared the proposed method with other trackers using the OTB-100 data set. The proposed FHMM-ISS show the second-best performance in terms of AUC. ECO shows state-of-the-art performance, but cannot run in real time. In contrast, the proposed FHMM-ISS is considerably faster than ECO, as shown in Table 4. In addition, the authors compared this method with JTR [16]. As shown in Table 5, the proposed method considerably outperforms JTR, because the proposed method fully communicates with multiple HMMs and each HMM handles changes in each part of an object.  show qualitative results of visual tracking methods. In most test videos, severe deformation of the target exists. In addition, the MotorRolling sequence contains objects which are severely rotated over time, while the Ironman sequence includes deformable objects with considerable changes in illumination conditions. The Diving sequence has highly non-rigid objects, where the object size and aspect ratio considerably vary over time. The Shaking sequence contains faces, which significantly shake. The Soccer and Matrix sequences have diverse obstacles at the same time, which include illumination variation, scale variation, occlusion, motion blur, and background clutters. The Skating1 sequence contains nonrigid object deformation and considerable illumination variation. The Bird1 sequence has fast motion, deformation, and out-of-view attributes. As shown in Figure 8, the proposed FHMM-ISS accurately tracked highly deformable objects. Each FHMM robustly tracked each part of the targets, while the proposed interaction strategies prevent each part from being drifted into the background. Moreover, each FHMM can handle different obstacles in the scene and propagate information on good states to other FHMMs. SiamSW and ECO also tracked the targets successfully. However, they failed to represent the target size and aspect ratio precisely.

| CONCLUSION
In this study, a novel visual tracking method was developed based on an interactive FHMM that can utilise the structured T A B L E 1 Visual tracking performance of the proposed FHMM-ISS according to the number of parts (i.e. K) in terms of AUC