Part-MOT: A multi-object tracking method with instance part-based embedding

Part-MOT, a one-stage anchor-free architecture which uniﬁes the object identiﬁcation representation and detection in one task for visual object tracking is presented. For object representation, a position relevant feature is obtained using the center-ness information, which takes advantage of the anchor-free ideal to encode the feature map as the instance-aware embedding. To adapt to the object’s movement, the clustering-based method to get the global instance feature is introduced. This enables this approach more robust to make better tracking decisions. Part-MOT achieves the state-of-the-art performance on public datasets, with especially strong results for object deformation and movement changes.


INTRODUCTION
Computer vision has made significant advances in image classification, instance segmentation, and object detection. Tracking, on the other hand, remains challenging, especially when multiple objects are involved. Recent results of tracking evaluations [1][2][3] show that bounding box-level tracking performance is saturating. Experiments [4] show that further improvements will be possible when using more information about pixel level.
The bounding box has been the dominant form for object annotation. One reason is that it is very convenient to annotate with little ambiguity, and the bounding box provides sufficient localization information for object detection. Another reason is that almost all image-feature extractors, both traditional [5] and deep-learning era [6,7], are based on an input patch with a regular grid form. Moreover, it has been proved that bounding box representation facilitates the feature-extraction process [8,9]. But for multi-object tracking, this annotation suffers from irregular shapes of different objects. In which, the bounding box annotation contains too much information from other objects; and make it hard to obtain the identified features. Also, the This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2021 The Authors. IET Image Processing published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology datasets that can be used to train and evaluate models for more precise representation, such as instance-wise level, usually do not provide annotations on video data or even information on object identities across different images. Taking this into consideration, dense boundary point prediction method is used to provide more accurate annotation.
More research attention has been paid to the one-shot MOT method, with the maturity of multitask learning [10][11][12] in deep learning. A single network is applied to simultaneously accomplish object detection and identity embedding. Due to sharing most of the computation, this network structure largely reduces the inference time. For example, a Re-ID branch is added in Track-RCNN [13] based on Mask-RCNN [14], in the meantime accomplishing the bounding box regression and classification.
However, due to many ID switches, one-shot methods usually have less accuracy than two-step methods. The reason is that the network does not learn the optimal features of objects because the embedding features are not aligned with the object centres, creating much ambiguity. To alleviate this problem, the anchor-free approach is used for object detection and identity embedding. YOLO [15], the popular one-stage anchor-based detector, predicts bounding boxes only at the points near the object center. The main idea is that the points near the center can provide more confidential detection. Because the number of center points is too small, YOLO suffers from low recall. While the FCOS [16] makes use of all points within the ground truth bounding box, and the "center-ness" branch is proposed to suppress the low-quality detection. So, a comparable recall is achieved compared with the anchor-based detectors.
Here, we propose an end-to-end trainable network, which takes detection, instance representation, and tracking as interconnected problems without instance-level annotations. Instead of detecting a rectangular bounding box, our detection localizes the parts of an object instance. More specifically, the aim of our detection is to predict a set of part-boundary points, which are more flexible for describing various shapes of objects. Using part-based boundary points has two advantages: (1) CNN features of irregular object regions can be accurately acquired with part-based boundary points, effectively eliminating the disturbance of background noise to the subsequent object representation and tracking; and (2) with part-based boundary points, an irregular shape can be easily transformed or rectified into a bounding box, which is a realistic annotation. Therefore, partbased boundary points appear to be a reasonable representation that can smoothly and effectively bridge representation and tracking modules.
Considering these factors, we model objects by boundary points which adaptively position themselves in spatial extent, to enable semantically aligned feature extraction. Usually, each object corresponds to multiple detection results, which come from different feature representations. More clearly, the features extracted in detection branch represent different parts of the object. The Part-MOT is proposed, a framework that formulates objects as part-based representation and extracts instance-aware embedding features for tracking association. The proposed approach is shown in Figure 1. First, we use a simple anchor-free object-detection approach to estimate the class confidence, the part feature center-ness and boundary points. Then we adopt a parallel branch for extracting the instance-aware features as the objects' identities by considering the feature parts' center-ness. In particular, the low-dimensional identity representation can reduce the computation time as well as improve the robustness of instance association. Furthermore, to deal with differently scaled objects, we use the deep layer aggregation operator [17] with the FCOS [16] structure. To improve the robustness of association, we employ a clustering method, which aggregate the object features to refresh instance representation.
This work makes three contributions: (1) we introduce a new method for object representation using part-based boundary points that adopts the FCOS for part feature extraction; (2) we use the center-ness of object parts to extract instance features for the Re-ID module, which aggregates the parts feature to a global representation by the center-ness confidence; and (3) to improve the robustness of tracking, the object feature is clustering according to the object moving type, which further enrich the representation.

METHODOLOGY
This section proposes the whole tracking framework, Part-MOT, which takes the part-based representation and instanceaware embedding network for multi-object tracking. The network architecture consists of two components. First, the detection branch is designed for object classification, part center-ness prediction, and pseudo-bounding box. The centerness confidence is utilized to aggregate part features to a global

FIGURE 1
Overview of our one-shot Part-MOT tracker. First, an encoder-decoder network is used to obtain high-resolution feature maps (stride = 4). Then two parallel heads are added to predict center-ness and pseudo-bounding boxes and instance-aware instance features, respectively. The features with different center-ness are combined representation, and the pseudo-bounding box is generated by n boundary points. Second, the instance-embedding branch is used for embedding feature extraction at instance level to identify different instances. Based on the instance-aware embedding, we introduce a tracking strategy that considers movement. We use not only traditional IoU but also the embedding similarity to associate objects. Further, we employ the clustering method for object movement to improve the tracking robustness.

Backbone encoder-decoder network
Per-pixel-level object detection is used in the FOCS [16] structure. Specifically, we adopt the ResNet-34 [17] as the network backbone, which achieves a good balance between accuracy and speed. Further, we use a variant of DLA [18] to tackle objects with different scales, which is shown in Figure 1. Multi-level features are extracted to resolve the overlapped bounding box ambiguity. And the recall is improved, compared with the Feature Pyramid Network (FPN) [19], the high-level feature takes more information from low level in this structure. Formally, the deep aggregation function Agg n , with depth n, is formulated as: where x is layer, Nd is the aggregation node, Rt propagates the aggregation of all previous blocks, Lt merges aggregation nodes of the same depth. The definition is: where B represents a convolutional block. In addition, the bilinear interpolation method is used in the upsampling module, to alleviate the alignment problem.

Part-based detection
Let F i ∈ R H ×W ×C be the feature maps of a CNN layer. Each location (x, y) on F i , is usually taken as the object center. While we take them as a sample of the object, inspired by semantic segmentation [20]. So, every location of feature map can map to a part of the input image. Specifically, the part within the groundtruth box is considered as positive sample. Thus, the proposed network can leverage as many samples as possible to train every branch. We treat object detection as a dense regression task on a highresolution feature map. Particularly, we estimate the heatmaps, object center-ness, and bounding boundaries by appending three parallel branches to the backbone network. We use a 3 × 3 convolution (with 256 channels) on the output feature maps of the backbone network in each branch, followed by a 1 × 1 convolutional layer.

Heatmap head
This head is used for estimating the probability, that the part belongs to a certain class, and the Gaussian-based ground-truth representation is used to compute the loss. In particular, the dimensions of the heatmap are 1 × H × W. When a location collapses with the ground truth, the heatmap is expected to have a higher response, and this feature is more informative for the classification task. With the distance between the heatmap and the object center, the response decays exponentially.

Center-ness head
To reduce the performance gap between FCOS and anchorbased detectors, a single layer branch, in parallel with the classification branch (as shown in Figure 1) is added to predict the center-ness of a feature location. So, the further from the object center, the lower the score, which largely suppresses the lowquality detections, as shown in Figure 2.
As defined in PolarMask [21], given a set {d 1 , d 2 , … , d n } for the length of n rays of one instance, where d max and d min are the maximum and minimum of the set, and the center-ness is defined as: As mentioned before, the different locations on the feature map correspond to a part of the object. The center-ness can be used as an effective strategy to combine part-based features so that the closer d max and d min are, the higher weight the part is assigned. The sqrt function is used to reduce the decay of the center-ness, and the binary-cross-entropy (BCE) loss is used to train the center-ness branch, which ranges from 0 to 1. When testing, to filter out the low-quality bounding boxes, the classification score is used by multiplying the predicted center-ness, and then the non-maximum suppression (NMS) is adopted, which can remarkably improve the detection performance.

Regression head
We estimate the bounding box of each object with this regression branch. Like [22], four convolutional layers are added on the feature maps of the backbone networks. Moreover, to ensure the prediction is positive, the exp(x) is utilized on the top of the regression branch to map any real number to (0, ∞).
As discussed before, the bounding box does not account for shape and semantically important local areas. Thus, for later instance representation, the bounding box considers only the rectangular spatial scope of an object, which is too coarse for To down-weight the low-quality prediction close to the boundary, the center-ness is multiplied with the classification score object tracking. To obtain finer object localization and better features, we use a set of adaptive points R as the object boundary: where n is the total number of sample points, which is set to 9 by default in experiments. Due to the dense distance regression, the imbalance between the regression and classification should be taken into consideration. Intuitively, the n rays consist of one instance boundary. The n rays should be trained as a whole, rather than as a set of independent examples. Considering this relevance, the imbalance issue can be largely resolved.
To take advantage of bounding box annotations in the training process, as well as train and evaluate the boundary-based object detectors, the predefined converting function T: R P → B P is used to convert boundary points into a bounding box, where T (R P ) represents a pseudo-box. Three converting functions [23] can be applied for this purpose:

Part-based representation and instance-aware embedding
To generate features that can distinguish different objects, we propose the instance-aware embedding branch. Ideally, the representation of the same object should be more similar than that of different objects. The standard 2D convolution with kernel size of k × k samples features using a fixed regular grid )}, where[⋅] denotes the floor function. The regular grid G cannot guarantee the sampled features of the object in corresponding regions. Therefore, we use the deformable convolution [23] to convert the sampling positions from the fixed region to the predicted region, and we extract the part feature with the feature-alignment module, which is formulated as: where x indicates the feature map, w denotes the learned convolution weight, u indicates a location on the feature map, and f represents the output-part feature map. Then, weighted by the center-ness score, we obtain the instance-aware feature, as shown in Figure 4. Because the transformations of the sampling positions are adaptive to the variations of the object, the extracted objectaware feature is robust to the changes of object scale, which is beneficial for feature matching during tracking. Moreover, the instance-aware feature provides a global description of the candidate targets, which distinguishes the object from the background more reliable.
This idea is explained in Figure 5. Due to the deformable convolutions, the features on the boundary can be extracted to  be more instance informative, which can be seen in the middle, and the boundary sample region is adaptive to the object level, as seen in the yellow samples. Traditional convolution contains more information about the background, as shown on the left, and because of that, the traditional detection methods use only the best classification score of the part features, dropping the others. For the tracking task, the object features of the other part beside the center play a very import role. So, to take advantage of the deformable convolution, we use all the features of different parts to distinguish different instances, as shown on the right with the different colour block. We combine the different part features with the center-ness score to alleviate the  Instance association. Each instance-embedding feature is clustered to a different slot for robustness, and the feature with minimum cosine distance is used for linking. The IoU loss and embedding distance for association are adopted impact of stride, where the final instance feature is obtained by weighted summation.

Clustering-based adaptive tracking
To link the instances, we employ the standard online tracking procedure. Features based on the instance-aware embedding are initialized in the first frame. In the subsequent frames, each instance-aware embedding feature is clustered to adapt to the instance variance, where the cosine distance is used to measure the similarity, as shown in Figure 6. And f embedding indicates the final instance embedding of current frame. To obtain the final tracking result, we perform the instance association based on similarities. Given the pseudo-bounding boxes Box p i and Box p j , and their embedding f embedding i and f embedding j , the similarity is formulated as follows: ( where Dist denotes the Euclidian distance and IOU represents the pseudo-bounding box IoU, is set to 0.5 by default. If an active track does not update for recent frames, the track is ended automatically. For each instance, the similarity between the latest embedding of all active and current tracks is computed according to Equation (3). Following [24], the similarity threshold is set for instance association, the Hungarian algo-rithm [25] is exploited to perform instance matching, and the unassigned high-confidence detection will start new tracks. By default, and are set at 24 and 0.3, respectively. Due to the movement of the tracking objects, the extracted features continually change, as shown in Figure 7. The traditional association methods are usually hard to balance the complexity and efficiency, such as Kalman filter, which has some prior parameters to determine, including process variance, error variance. We use the clustering method to reduce the ID exchanges and improve tracking robustness, where the cosine function is used as the similar measurement. And furthermore, the method to update the features can be more exploited.

Loss function
According to the description above, the loss functions are defined as follows: ). Specially, the Gaussian function is used to get the heatmap response at the location (x, y), defined as where N is the total number of objects and c is the standard deviation. The pixel-wise logistic function is used as the heatmap loss with focal loss [26]: whereM represents the estimated heatmap, and α, β are superparameters.

2.5.2
Center-ness loss and regression loss The center-ness loss and regression loss are defined as follows: where L cls indicates the focal loss as in [26] and L reg represents the pseudo-bounding box IoU loss as UnitBox [25], c * ⋅ indicates the location (x, y) within the ground-truth box, c is the class label, and b x,y is the bounding-box coordination. N pos is the number of positive samples and is used to balance the classification loss and regression loss. 1 {c * i >0} is defined as the indicator function, being 1 if c * i > 0 and 0 otherwise. The total loss is calculated by summation over the whole feature map.

Instance-embedding loss
Object identity embedding is defined as a classification problem. Particularly, all object instances of the same identity of the training set are taken as one class [27]. For each ground-truth box b i = (x i 1 , y i 1 , x i 2 , y i 2 ) in the image, extract the embedding feature vector to get the class distribution p(k). When we denote the ground-truth class label as L i (k) with one-hot encoding, the embedding loss is computed as: where K is the number of classes.

Datasets and evaluation metrics
We composed a large training data set by combining the training data from the MOT16, MOT17 [1], and CUHK-SYSU [30] data sets. They contain sequences with varying viewing angle, size and number of objects, camera motion and frame rate. Average precision (AP) [30] was used to evaluate the detection performance, and true positive rate (TPR) was applied to evaluate the embedding representation. The tracking accuracy was evaluated with multiple object tracking accuracy (MOTA) [30] and ID F1 score (IDF1) [31], as they quantify two of the main aspects of multiple object tracking, namely, object coverage and identity preservation.

Training details
A variant of DLA-34 [18] is used as our default backbone (see Section 2.1), and the model weights is initialized on the COCO dataset [34]. And for the instance-aware embedding branch, which is pretrained with the task of Re-Identification (ReID) on two publicly datasets: Market1501 [35] and CUHK03 [36]. On MOT dataset, we do data augmentation by rotation, scaling, and cropping, and jointly finetune the detection and instance-aware embedding branches by adding the losses (Equations 8-10) together. The batch size is set to be 16. The input image is resized to 1088 × 608 and the feature map resolution is 272 × 152. We train for 22 epochs with a learning rate 1e−4, weight decay term 1e−5 and an Adam Optimizer with β1 and β2 set to 0.9 and 0.999, respectively.

Part-based instance embedding and center-based embedding
This section evaluates the impact of our proposed partbased instance embedding. Particularly, we compare with the  center-based embedding method, where the embedding feature is extracted using only the center features, and the feature map is referred to as the object center position [32]. The remaining factors stay the same, and the 2DMOT15 training set is split into eight training videos and three validation videos. The feature of 512 dimension has been used in previous works, and to evaluate the importance of the feature dimension, we compare the different dimensions. The results are shown in Table 1.
One can see that the overall MOTA score is improved using our proposed part-based instance-aware embedding method. Through analysis of the detailed metrics, the importance of fewer ID switches becomes clear, and the detection metric measured by TPR improves from 61.5 to 69.2. Furthermore, when the representation dimension decreases from 512 to 128 gradually, the TPR improves, the number of ID switches decreases from 182 to 127, and the inference speed also improves. By comparison, with the increase of the feature dimension for the part-based method, the MOTA score declines. This is because there will be more information from the boundary parts that contain more features from other instances, which makes the fused parts of the model difficult to identify. So, proper lowdimension representation satisfies the tracking task. Figure 8 shows the embedding features visualization using different method. Different instance features are mixed for a center-based approach, especially for high dimensions. The proposed part-based approach distinguishes them well. We think the backbone also plays an import role, which enables more flexible receptive fields for objects of different sizes, it is very important for anchor-free method. Since anchor-free method has no anchors for scale variance, and the different scale convolutions can be a substitution for that, which can be confirmed in Table 2.

Cluster-based association versus similarity matching
We evaluate the different association mechanisms in Table 3. The basic tracking structure follows the explanation above, but with different association heads. For the proposed method, the association matches with detections up to β frames in the past. While the other mechanisms do not need initialization, only the adjacent frames are associated.  The table reveals that the proposed cluster-based association method achieves a notably better MOTA score, due to the representation clustering. In particular, the number of ID switches decreases from 182 to 127, which indicates that the applied method largely improves the robustness. And the traditional bounding box IoU association mechanism is also evaluated, it can be found that just the IoU mechanism large decrease the performance, because it does not take the instance feature into consideration in detection at the meantime. But the association speed is largely improved.

Qualitative results
In this section, we compare the proposed Part-MOT with state-of-the-art one-shot methods, including the JDE [29] and FairMOT [33], which also put together object detection and identity-feature embedding. And other SOTA methods such as SORT [24], DeepSORT [38], DeepMOT [39], MPNTrack [40], CSTrack [41], CTracker [42], CenterTrack [43] and FairMOT [33]. Specifically, we use the 2DMOT15-train and MOT16-test dataset to validate our model. The performance results are shown in Table 4 with the MOTA metric [30] and IDF1 [31], showing that the proposed approach outperforms the other two methods. This validates the effectiveness of the one-shot partbased approach over the traditional center-based one.
The MOTA score versus speed curve on 2DMOT15 is shown in Figure 9. One can see that the proposed Part-MOT method ranks first among all the trackers, with higher MOTA score and faster inference speed, by taking advantage of the simple structure. Figure 10 shows the successful tracking cases on MOT2015 and MOT2016. Owing to the robustness of the part-based instance-aware embedding, our tracker can detect and track targets with huge scale deformation. The combination of backbone and the instance-aware embedding branch can also adapt to the varying object shapes. The failure cases are shown in Figure 11. When the target is severely occluded by a similar object, the tracking box drifts. The same failure can be caused when the ground truth contains only a small part of an object, due to our tracker taking more semantic information about objectness and tracking the whole instance.

CONCLUSION
Here, we propose a one-shot, anchor-free, multiple-objects tracking method that consists of two parallel branches. The detection branch classifies instances and regresses the dense points of the sampled locations around contours, and it estimates center-ness to weight the part features of a global instance representation. The instance-aware embedding branch extracts the part-based global instance features, using a combination of deformable convolution and center-ness scores. To improve the linking robustness, we first cluster the features of the instance, and use the IoU and feature-similarity measure to do the association. Part-MOT is designed as a simple one-shot object tracker, and the experiments show its effectiveness, accuracy, and robustness. In the future work, we will study applying our framework to other video tasks, for example, video object detection and segmentation, and at the meantime, explore other instance representation learning method, such as contrast learning [44] to this framework.