Segment attention-guided part-aligned network for person re-identiﬁcation

Part misalignment of the human body caused by complex variations in viewpoint and pose poses a fundamental challenge to person re-identiﬁcation. This letter examines Res2Net as the backbone network to extract multi-scale appearance features. At the same time, it uses the human parsing model to extract part features, which can be used as an attention stream to guide part features re-calibration from the spatial dimension. Additionally, in order to ensure the diversity of features, SAG-PAN effectively integrates the global appearance features of person image with part ﬁne-grained features. The experimental results on the Market-1501, DukeMTMC-reID and CUHK03 datasets show that the proposed SAG-PAN achieved superior performance against the ex- isting state-of-the-art methods.

Introduction: Person re-identification is a cross-camera retrieval task. The emergence of this task is due to the increasing public safety requirements and large-scale camera networks in public areas. However, because of the misalignment of the human body parts caused by crossview, multipose, illumination, and occlusion, this research is still challenging.
Most methods for person re-identification assume that all features of image's different parts are equally important, the entire image sharing one filter bank. But in reality, contributions of different parts to person re-identification are different. Recent researches mainly focus on using attention mechanism to relocate features. Li et al. [2] proposed harmonious attention network(HAN) to combine soft pixel attention and hard zone attention through harmonious attention module. Guo et al. [3] proposed P 2 -Net. The kernel module of P 2 -Net is dual part-aligned block (DPB), which uses human analytic model to generate body part covering mould. Those methods only focus on local features but ignore the global features. This letter, in order to solve this problem, introduces our segment attention-guide part-aligned network (SAG-PAN) where the human body analytical model is introduced as attention to extract local features while preserving the global features.
Segment attention-guided part-aligned network: Our proposed network, SAG-PAN, enables network selectively extract more accurate depth features from spatial dimension by human part parsing model which introduces part attention. The specific structure is shown in Figure 1, including global branch and human part branch. Human part branch extracts part feature and then converts it into a human body part mask to capture the local feature. Global branch is used to extract global feature. During Human part branch: Human part branch is a dual-stream structure network. Res2Net-50 is used as the backbone network of the appearance feature extractor. The human part parsing model extracts the part features. Then, it converts the part features into a part mask and weights the appearance features to obtain local features.
Human parsing model adopts the parsing branch of augmented context embedding with edge perceiving (A-CE2P) [4] to capture high-level semantic perception information. The specific structure of human parsing model includes backbone network and context coding module. The backbone network consists of three convolutional layers, a pooling layer and four groups of ResNet-101 sub-modules. The context coding module using global context information to identify fine-grained category information to extract high-level features of different parts. Human parsing model models seven parts of the human image: background, head, torso, upper arm, lower arm, upper leg, and lower leg, of which the kernel module is context coding module.
The binary mask f bp ∈ R H ×W can be obtained by calculating following equation: where f i p represents the feature map of f P 's ith channel, f a ∈ R H ×W ×CA represents appearance feature and f p ∈ R H ×W ×CP represents part feature, H and W represent the height and width of feature map respectively, C A and C P represent the number of exterior feature channels and the number of part feature channels respectively.
Element multiplication is used to integration appearance feature and the binary mask of part feature. Then, global mean pooling layer is used to obtain feature f : broadcasting makes f bp get the same number of channels as f a through self-replication. Binary mask completes appearance feature relocation in the spatial dimension. Human part feature map reflects feature values of human body's various parts. The normalized part feature value can be considered as attention to guide appearance feature extractor to learn part feature efficiently and weaken the influence of background information. Lastly, BN layer is used for part feature f local to separate features: During training, f and f local are used to calculate triplet loss and identification loss respectively. During testing, f global is used as the description of sample's feature and cosine distance is used to measure the similarity between samples.
Our loss function is the combination of the identification loss, centre loss and triplet loss objectives, which can be expressed in the following equation as, where λ represents the trade-off factor of the global branch loss function.
L ID is a N-class cross entropy loss, where |B| represents the data size, N represents the number of classes, p i represents the probability that the sample belongs to the ith class. L triplet denotes the triplet loss as, where dist( f a , f p ) and dist( f a , f n ) respectively represent the distance between the positive sample pair and the negative sample pair. The m means margin. L centre denotes the centre loss as, where y i represents the class of the i-th sample, c yi represents the class centre of the i-th sample, x i represents the feature of the ith sample.  79.9% Top-1 accuracy and 76.3% mAP accuracy. On the Market-1501 dataset, the Top-1 accuracy rate of SAG-PAN is 95.7%, and the mAP score is 89.8%. Its Top-1 value is close to the performance of other stateof-the-art algorithms, but the mAP value is 1.5% higher than the closest algorithm. In addition, BAT-Net [7] is also a dual-branch network that uses an attention mechanism. On the Market-1501 dataset, SAG-PAN is 0.6% and 2.4% higher than BAT-Net on aspect of Top-1 and mAP, respectively. CCAN also used global and local branch to extract features. All of SAG-PAN' mAP and Top-1 are higher than those of CCAN on the Market-1501, DukeMTMC-reID and CUHK03 datasets. Figure 2 is a schematic diagram of the query sample results of Market-1501. It can be seen that SAG-PAN solves the problem of part misalignment of the human body caused by complex variations in viewpoint and pose to a certain extent.