Part-level attention networks for cross-domain person re-identiﬁcation

Person re-identiﬁcation (Re-ID) is in signiﬁcant demand for intelligent security and single or multiple-target tracking. However, there are issues in the person Re-ID tasks, such as sharp decline in cross-data sets detection accuracy, poor generalization and cross-domain ability of the model. This work mainly studies the generalization and adaptation of cross-domain person Re-ID models. Different from most existing methods for cross-domain Re-ID tasks, the authors use diversiﬁed spatial semantic feature in pixel-level learning in the target domain to improve the generality and adaptability of the model. In the case that no information of the target domain is used during the model training, the trained model is directly tested on the data set of the target domain. It has proven effective to add the attention cascade module into the backbone network combining with the part-level branch. The authors conducted extensive experiments based on the three data sets of Market-1501, DukeMTMC-ReID and MSMT17, resulting in both single-domain and cross-domain tests with an average improvement of Rank1 and mAP values of about 10% compared with Baseline through the authors’ proposed method named Part-Level Attention Network.


INTRODUCTION
The person Re-ID is an extension of the retrieval task which aims to search for the most likely matching image of the same pedestrian from different locations in different time. Person Re-ID is commonly used for security monitoring such as crossborder header tracking and intelligent security. In 2006, Gheissari et al. [1] was the first to propose using colour and significant edge histograms to recognize pedestrians. However, along with rapid expansion of general deep learning research, many scholars [2][3][4][5][6][7] began to explore person Re-ID methods based on deep learning. Re-ID is an open problem in view of the random and significant changes in person posture, lighting condition, human occlusion and background noise etc. leading to into the target domain, and then use the transformed image to train the model. However, they require to operate on both the source domain and the target domain, thus increasing the task complexity. Therefore, the challenge we are tackling is not to use any feature information of the target domain, which indicates that the model has been pre-trained and still achieves excellent performance in the new target domain.
The most challenging problem is how to optimize the adaptation and generalization ability of the model. In other words, the pre-trained model can still extract discriminative features from other data. In essence, this model has high adaptive generalization ability to extract discriminant features from different data sets. To address this issue, we propose the following improvements to the model of Luo et al. [15], which are used as our baseline: 1. We propose to introduce an attention cascade module by optimizing the SE (Squeeze and Excite) Model and PAM (Position Attention Model) cascade. The attention cascade module enhances the attention on spatial features of persons in the foreground and reduces the influence of background noise on model features. 2. We design a new part-level branch to divide the characteristics of the backbone network and extract the diverse features with different discrimination channel features. 3. By combining features extracted from the backbone network with part-level branch features, Re-ID model is trained to obtain a more generalized and adaptable model.
In the following sections, we first review the models and solutions of cross-domain learning, attention and feature learning in person Re-ID tasks (Section 2). We show the overall framework of the PLAN (Part-Level Attention Network) method, including attention cascade module and part-level branch design analysis (Section 3). Subsequently, we perform extensive experiments through three Re-ID data sets, namely, Market-1501, DukeMTMC-ReID and MSMT17, and discuss the role of each module in the PLAN method by ablation experiments (Section 4). We conclude with the problems solved by the PLAN method and produce the best performance in cross domain setup (Section 5).

RELATED WORK
The task of person Re-ID has cross-domain characteristics in practical applications, which requires a more generalizable and adaptable model. In the past decade, researchers focused on integrating target domain features into model training [16][17][18] such as camera angles, light intensity, person attributes and image style differences with the methods of GAN or Style transfer. Liu et al. [10] propose a new adaptive transmission network (ATNet) for person Re-ID. According to the principle of 'divide and conquer', ATNet finds the root cause of the domain gap and removes it. ATNet decomposes the complex cross-domain transmission into a group of factor-wise subtransmissions, where each sub-transmission focuses on the style transmission of one imaging factor. By selecting lighting, reso-lution and camera angle as the main factors of the difference between domains, ATNet proposes an adaptive strategy to integrate the factor transmission through measuring the influence of these factors on the image, and generates a set of images similar to the target domain style. Yu et al. [19] propose an unsupervised soft multi-label learning model for person Re-ID. Their method is to mine potential label information of unlabelled samples by introducing an auxiliary data set. The model compares an unlabelled image with the source-marked images, and learns a soft multi-label (a vector of likelihoods of the real label) for each unmarked person. There are also many other unsupervised methods [20][21][22][23][24][25] to train the target domain data set in clustering and false labelling, which operate on the target domain image, to a certain extent in the training model or image processing stage. All the above methods extract features from the image data processing fitting model, but they do not extract more discriminative features from the model itself. The method we propose here is to face the cross-domain challenge directly. The target domain does not participate in the training process of the model at all, only to test the generalizability and adaptability of the model. In the field of deep learning and computer vision, the SE model takes advantage of the relationship between channels [26]; and the channel attention is calculated by the global average of set features. By introducing spatial attention, a convolutional block attention module (CBAM) [27] is proposed to mine channel and spatial attention. They first aggregate the channel information of the feature map using maximum pooling and average pooling. Then these set features are concatenated and convoluted to generate a two-dimensional attention map. The above attention model has been proven to be an effective method to improve the performance of deep neural networks. Inspired by these works, more scholars [28][29][30][31][32][33][34][35] now use an attention mechanism in person Re-ID tasks. Chen et al. [31] propose two different attention branches in person Re-ID tasks to enable the learned feature map to perceive persons and related body parts, respectively. Wang et al. [32] propose an multi-receptive field attention (MRFA) module, which uses filters of different sizes to help the network focus on informative pixels, and introduce a Gaussian level random clipping/filling method to further improve the robustness of the network. Although the above methods apply various attention mechanisms to person Re-ID tasks, most of them only focus on the performance of a singledomain model, and do not test the performance across domains. We use a PAM (Position Attention Model) and an optimized SE Model to form an attention cascade module to join the backbone network, which is derived from the attention module of self-attention [33] improvement.
Feature learning is one of the key techniques for pedestrian recognition. In the past few years, various learning methods based on extraction of local, global and multi-branch features have been proposed. Therefore, how to get the characteristics with discrimination and diversity has been a hot topic. Zheng et al. [7] propose a Part-based Convolutional Baseline (PCB) network, which divides the convolution layer uniformly to learn partial features. A PCB network divides people's feature space into six partial networks, and the feature vectors of each part are used to generate a single ID prediction loss. In recent The pipeline of our part-level attention network years, the PCB method is very popular; it is widely used in person Re-ID tasks, and more effective methods are developed based on the PCB method [36,37]. In order to obtain different characteristics from the end-to-end training methods, the multi-branch network architecture has been widely used [4,7,38] usually following the shared backbone network with multiple sub-network branches. Therefore, we need to apply different module mechanisms between different branches, such as an attention mechanism. We combine the part-level feature and the multi-branch network to design a new model, which combines the local and global features, to extract the diversity and more discriminative features, and the generalization ability of the model has been improved. The attention cascade module and part-level branch are thus introduced into the backbone network to form our Part-Level Attention Network (PLAN).

PART-LEVEL ATTENTION NETWORK
The backbone of PLAN is ResNet50. The ResNet50 model is mainly composed from four residual convolution blocks, namely, stages 1 to 4, as shown in Figure 1. An attention cascade module is added after stages 2 and 3, respectively, to extract pixel-level spatial semantic features from the backbone network. The output of stage 4 splits into two branches. One backbone network branch is designed for the training of the backbone network feature classification. An optional Plus branch operates on the global features of the backbone network as the max and average pool, respectively. The other branch is the part-level branch, in which the input features are equally divided into four parts. The eigenvectors of the max pool and the average pool are added and then split into eigencolumn vectors for model training. Finally, the output features of the two branches are spliced and normalized for testing.

Attention
The position attention model (PAM) is inspired by self-attention [33]. It is used to capture and aggregate semantically related pixels in a spatial domain. Shown in Figure 2, as for each feature map, it is necessary to calculate the degree of correlation between each pixel and the appearance of other pixel points, thus obtaining the red feature map in the figure. But one pixel feature is too small, so instead, we have yellow 2 × 2 and blue 3 × 3 features. The dot product of the three feature graphs is used to obtain the appearance relative feature graph. And then the ith row of S i j is the relationship between the ith point on the characteristic graph and all the other points j .
If there are C channel feature maps, then it will end up with C grey feature maps. The input feature X ∈ R C ×H ×W firstly generates the feature maps F (x) ∈ R N ×C , G (x) ∈ R C ×N and H (x) ∈ R C ×N through 1 × 1 convolution kernel, where C is the number of channels, H × W is the feature map size and N = H × W . Then, F (x) and G (x) are used to compute the pixel attention matrix S ∈ R N ×N , and the correlation between N pixels is obtained, as shown in Formula 1. The pixel level attention matrix S is dotted with the mapping feature H (x), and the PAM feature m j is obtained by normalizing the BN layer.
Here, S i, j represents the degree of attention that the model pays to position i when synthesizing region j . The PAM features are transmitted to the output of the layer through: . (2)

FIGURE 2 The attention cascade module
A learnable parameter is initialized to 0 to control the weight of the output of the attention layer. The output features of the attention layer are then added to the input features to obtain the feature P i and input it into the cascaded optimized SE model: Optimization of the SE model: Because different channels of the feature map have different effects on identifying objects. SE module in the SENet [26] first carries out Squeeze operation on the feature graph obtained by convolution to produce the channel-level global features, then compiles Excitation operation on the global features to learn the relationship between each channel, and finally multiplied by the original feature graph to obtain the final features. In this way, the model can pay more attention to the channel features and expand the feature receptive field. As the global pooling operation of the SE module has lost some information in space, it is impossible to extract features more comprehensively and accurately in space. We use a 1 × 1 convolution kernel to replace the global pooling layer and the full connection layer in the SE module, and therefore retain the information of features in space. The feature map obtains an attention feature value between 0 and 1 through the Sigmoid activation function: As shown in Figure 2, through the cascade of two attention mechanisms, the PAM module obtains the pixel-level spatial characteristics. After optimizing the SE model, it is input into the backbone network to expand the feature receptive field and obtain more comprehensive channel feature information. Therefore, by extracting more accurate and comprehensive pixel-level spatial location features, the pixel-level spatial seman-tic features of more discriminating people can be obtained. The attention feature is multiplied (dot-product) with the original image feature vector:

Attention part-level branch
A PCB network [7] divides the whole feature space into m horizontal stripes and generates m part-level feature vectors. As shown in Figure 3, through the backbone network, the input image is passed forward with the convolution layer to form a three-dimensional tensor T. PCB only uses the traditional average pooling layer to sample the T-Space into m column vectors g 1 , g 2 , … g m , and then m classifiers generate m ID prediction losses. The fully connected (FC) layer and softmax function are used for classification. Therefore, given a batch of input marker samples (x i , y i ), i = 1, … , n, PCB uses multiple IDs to predict the loss as follows: where W j p and W y i p are the jth and y i th columns of weight matrix W p , respectively, and W p is the P classifier specified by g p . As illustrated in Figure 3, by forcing each part-level feature vector to match a single ID prediction loss, useful part-level features can be obtained to discriminate different persons. However, the multiple part-level feature vector may be unable to capture the discriminative information between persons. This limits the PCB method to obtain part-level information.
In order to learn the identification features with part-level, we suggest connecting m part-level feature vectors into a column vector d to calculate a single ID prediction loss: where W j and W y i are the j th and y i th columns of weight matrix W , respectively, and W is a single classifier of vector d.
The vector d contains all information of the input image, and sufficient discrimination information can be learned by using a single ID loss. The proposed method is thus to perform max pooling (MP) and average pooling (AP) of tensor T to capture the statistical characteristics of different channels in the map:

Implementation details
Based on the Pytorch framework, the method in this paper uses the Resnet50 model [15] pre-trained on ImageNet as the backbone network. In stage 4, we change the convolution stride from 2 to 1. The input image in the training stage is randomly scaled to 256 × 128 for data expansion with a filling of 10. We randomly extract p-identity and k-person images to form a training batch, while the requirement for triple loss is equal to P × K .
Here we set P = 16 and K = 4, and use the Adam method [21] as the optimizer, where 1 = 0.9 and 2 = 0.999. For better performance, we adapt the learning rate according to a fixed schedule in four stages, and use 10 epochs to increase the learning rate from 3.5 × 10 −5 to 3.5 × 10 −4 linearly. The learning rate in the 40th stage is decreased to 3.5 × 10 −5 , while in the 70th stage, the learning rate is decreased to 3.5 × 10 −6 . A total of 120 epochs were trained in Market-1501 and DukeMTMC-ReID data sets, and 200 epochs were trained in MSMT17 data sets. The learning rate lr (t ) of period T is calculated as: We also use a data enhancement method proposed by Zhong et al. [42], called random erasure augmentation (REA). In this experiment, we set the super parameters to P = 0.5, the size of random erasure area to 0.02 < Se < 0.4, and initialize randomly the size of length-width ratio of random erasure area as 0.3 to 3.33.

Ablation study
As shown in Tables 2, 3 and 4, we have done ablation experiments to compare with the baseline presented by Zhong et al. [42] and verified the effectiveness of the attention mechanism, part-level branch, global pooling layer and maximum pooling    Table 2 ('+att'), in a single-domain setting, we get basically the same result as baseline, but in a cross-domain setting, our effectiveness is significantly improved. It shows that the pixellevel spatial semantic features extracted by our attention cascade module optimize the generalization ability of the model. 2. The part-level branch we designed is similar to PCB [7], but the PCB network uses multiple IDs to predict the loss, which limits the acquisition of part-level feature information. Our part-level branch first obtains the characteristics of different channels through MP and AP, and then splices each part's features into a global feature for a single ID prediction loss, so as to obtain more comprehensive feature information pertaining to the complete image. As shown in Table 2 ('+part+amp'), our test results on the Market-1501 and DukeMTMC-ReID data sets are improved to different degrees in both single-domain and cross-domain settings. The performance and generalization ability of the model are further improved by our method of single ID prediction loss for multi-channel characteristics. 3. Now turn to the case with both the attention cascade module and the part-level branch added to the backbone network. First, we extract the pixel-level spatial semantic features, and then through the part-level branch, we refine the features of diversity. Then, we train the network with the global features of the backbone network and achieve the person Re-ID. As shown in Tables 2 and 3, excellent performance has been achieved in the Market-1501, DukeMTMC-ReID and MSMT17 data sets. In Table 2 Table 3, we compare the MSMT17 data set to the Baseline in a single-domain setting: the Rank1 and mAP values increase by 5.2 and 5.8 percentage points, respectively; in a cross-domain settings compared to the Baseline: Rank1 and mAP values of MSMT17 to Market-1501 increase by 3.2 and 2.6 percentage points, respectively and the values of MSMT17 to DukeMTMC-ReID Rank1 and mAP increase by 10.6 and 7.1 percentage points, respectively. Cross-domain and single-domain test results improve further as we select the Plus branch in the backbone network.
From the above analysis, it can be concluded that the attention cascade module obtains pixel-level spatial location features by calculating the similar relevance among pixels. Then, the optimized SE module further expands the feature receptive field and extracts larger and more comprehensive feature information through 1×1 convolution global pooling. The part-level branch extracts global features and local features by evenly splitting the backbone network. By maximal pooling and average pooling for local features, the feature information of different channels can be obtained. Therefore, the Resnet network can learn fine diversified features during the training process by adding attention cascade modules and part-level branches. With random erasure removed ('-REA'), the accuracy of model test is reduced in a single-domain setting, but the test effect is greatly improved in a cross-domain setting, resulting in the random erasure (REA) data enhancement makes the model more effective in cross-domain tests as more information can be learned.
It is noted that different performance can be made by different model transfer direction during cross-domain testing. For example, the accuracy rate of testing with Market1501 (M) as the source domain and DukeMTMC-ReID (D) as the target domain is somewhat different from that of testing with D as the source domain and M as the target domain. The accuracy rate of D→M test is often higher than that of M→D test. The reason for this phenomenon is inextricably linked to the scale and quality of the data set. Because the scale of Market-1501, Dukemtmg-ReID and MSM17 data sets is getting larger and larger in turn and the more up-to-date. The quality of the images in the data set is also getting better. In the training process, the model trained by a larger-scale data set with higher image resolution and superior quality has stronger generalization and can better adapt to the best of different scenes. Therefore, in the training model process of person Re-ID task, it is particularly important to have a large-scale and high-quality data set that can learn the discriminative features such as in the field of computer vision.

Comparison with state-of-the-art methods
We compare our work in Tables 5-7 with recently reported methods on person Re-ID tasks (such as CVPR19, ICCV19), and all the results presented are obtained without any re-ranking. In the tables, bold indicates the best and second-best results. In Table 5, using the Market1501 and DukeMTMC-ReID data sets in a single-domain setting, our method is compared with various new methods, such as DDAF-BoT [31], VMRFANet 2020VMRFANet, Pyramid [37], BDB [38], OSNet [48], ABD-Net [49], SNR [52], HOReID [53] and AdaptvieReID [54]. In our single-domain test of the data set Market1501, we use the Plus branch for backbone networks, which is basically our best result. In addition, compared with the effect of our improved Baseline ('PLAN'), our model tests ('PLAN (Plus)') show that the values of Rank1 and mAP increase by 1.5 and 2.4 percentage points, respectively. The Rank1 and mAP values obtained from baseline in the DukeMTMC-ReID data set are improved by 3.4 and 5.6 percentage points, respectively, exhibiting the best performance. In Table 6, Market1501 and DukeMTMC-ReID are used as source and target domains, respectively, for crossdomain tests. In this case, we do not conduct any operation on the target domain and directly use the test to compare with general methods such as ATNet [10], PTGAN [11], SPGAN [12], SPGAN+LMP [12], TJ-AIDL [16], HHL [18] and PUL [55]. We achieve the best results both with and without the REA method. Since MSMT17 is a very recent data set, and only few studies have used it for cross-domain experiments, so we only compare it with the Baseline. In Table 4, we use MSMT17 as the source domain and Market1501 and DukeMTMC-ReID data set as the target domain for cross-domain testing, which also achieve the  best results compared to Baseline. At the same time, our method achieves the best effect in the MSMT17 data set, compared with methods Auto-ReID [36], IANet [50], AdaptvieReID [54], PDC [57] and GLAD [58] in the single-domain setting, as shown in Table 7.

CONCLUSIONS
This paper proposes a solution PLAN to solve cross-domain person Re-ID problems, without using any information of the target domain to train the model. We improved the generalization and adaptability of the model by extracting the discriminative and robust person features; in this way, we solved the problem of the degradation in pedestrian recognition across domains. We used a combination of attention module cascade and part-level branch to learn the discrimination features.
Experiments showed that PLAN achieved the best performance across domains with better recognition ability and robustness of the model in three Re-ID data sets.