1. Introduction
The Re-ID task aims to search for the most likely images belonging to the same pedestrian in the gallery (candidate image sets). Common challenges include background interference, angle of view change, light intensity, body posture change and occlusion [
1].
Recently, algorithms based on deep learning [
2,
3,
4,
5,
6,
7,
8] in the Re-ID direction have made significant progress. One of the development trends of related algorithms is using attention mechanism to capture global and local features. Specifically, global features can directly represent the changes of appearance and spatial position of an image. Yang et al. [
4] discovered that combining the features of local and global can capture more semantic information. However, global features also consist of the interference information (the background, etc.) due to the lack of local features.
To solve this problem, Chen et al. [
5] proposed the mixed high-order attention network (MHN) to model and utilize the complex high-order statistical information in the attention mechanism. Chen et al. [
6] proposed attention but diverse network (ABD-Net) to combine attention modules as a complementary mechanism.
The problems of occlusion, similar shape and size changes make retrieval extremely challenging. For
Figure 1a, most of the labeled pedestrian’s body is covered. For
Figure 1b,c, the labels to be retrieved are similar to nearby people to a great extent. The target pedestrian in
Figure 1d is interfered with by background information. Luo et al. [
9] proposed a strong baseline with six tricks (including warmup learning rate, random erasing augmentation, label smoothing, last stride, batch normalization neck (BNNeck), and center loss). Woo et al. [
10] extended the receiving by a larger convolution kernel.
Luo et al. [
11] proposed that the model’s practical field is only a tiny part of the theoretical acceptance domain. We designed the DBA-Net to capture apparent features of various dimensions. We are committed to integrating attention modules and normalizing the network with a dual branch strategy.
In particular, our CPSA block integrates global semantic information into local features and focuses on local information between feature maps. Specifically, this module includes a channel attention module (CAB), position attention module (PAB) and spatial attention module (SAB). We observed that CAB can focus on the feature differences of different channels, PAB can learn the dependency of different feature maps through dimensional transformation, and finally SAB integrates apparent global features and plain details of channel, position, and spatial dimensions.
Loss functions play an essential role in optimizing and updating model parameters in backpropagation. Generally, researchers [
1,
6,
10,
12] use the loss function of cross-entropy (ID loss) and metric losses to optimize their models. ID loss can correctly assign pedestrians to their classes. Meanwhile, the primary characteristic of Re-ID task is of retrieving the most similar image to the target pedestrian from the gallery. We introduce metric learning, which can autonomously learn the metric distance for a specific task. By calculating the similarity between two images, they can classify the input image into the image category with high similarity.
Some researchers [
6,
13,
14] chose to use the triplet loss function and have achieved good results. In this paper, we also use triplet loss to enhance the ranking performance. Wang et al. [
15] posed a rank list loss (RLL). Since there are hard-negative samples, some negative samples are very similar to the input image, and the network is likely to classify them into the range of positive samples. Gong et al. [
16] designed WRLL to solve this problem. In this article, we use ID loss, triplet loss and WRLL to optimize DBA-Net.
In summary, our work has the following contributions:
We design a residual attention block (CPSA) to extract channel, position and spatial features in different dimensions. It can capture key fine-grained information, such as bags and shoes. We observe that these three blocks are complementary and can capture the vital information of the input image adaptively.
We notice that the network with multi-branches was more effective in learning robust features than single ones. Specifically, we not only use the global network branch but also introduce another branch after the fourth layer of ResNet-50. Each branch utilizes generalized mean pooling (GeM) [
17]. The double branch strategy is significant for fusing global and local vital information.
Based on ID loss, we use complementary triplet loss and WRL loss to optimize DBA-Net. Each branch employs the same loss function. Triplet loss and WRL loss can enhance the intra-class compactness and inter-class separability in the Euclidean space.
A large number of experiments on Market1501 [
18], DukeMTMC-ReID [
19] and CUHK03 [
2] proved DBA-Net is superior to the state-of-the-art network. Especially on the CUHK03 dataset, the mAP of DBA-Net achieved 83.2%.
The paper is organized as follows: In
Section 2, the relevant background to our research is introduced. In
Section 3, we present our algorithm in detail. In
Section 4, we show the comparative and ablation experiments on these three popular open-source datasets. In
Section 5, we visualize and analyze the proposed model. Our overall conclusions are then presented in
Section 6.
2. Related Work
In this section, we present some outstanding methods related to our network and give a brief introduction to them.
2.1. Attention Mechanisms in Person Re-ID
A host of classic methods can train effective discriminant models [
4,
20,
21,
22,
23]. Pyramid [
24], PCB [
25] and MGN [
20] achieved advanced performance by integrating global characteristics and local features based on different stripes. Zheng et al. [
24] noticed that both global information and local information of the image are vital and need to be used together. Sun et al. [
25] combined soft-segmentation (the attention of a neural network) with hard-segmentation (preprocessing and blocking directly), which as better than algorithms only using soft-segmentation or hard-segmentation. Wang et al. [
20] directly divided the input image into several horizontal strips, emphasizing that we obtain semantic information not only from meaningful regions.
Simultaneously, there are problems regarding information redundancy and over-fitting due to the complexity of models. Global features are responsible for the extraction of overall features, and each block of image segmentation is responsible for the extraction of different levels of features. Therefore, there is redundant information between global features and local features.
To diminish the redundancy of a Re-ID network, some researchers [
5,
8,
9,
23,
26,
27,
28] focused on models with attention blocks. Tai et al. [
9] proposed an attribute attention network (AANet), which integrates information of pedestrians’ critical parts in the network. Zhang et al. [
27] presented second-order nonlocal attention (SONA), which can connect local information through second-order features. Li et al. [
8] posed harmonious attention (HA), which introduced a cross attention interactive learning mechanism to further optimize attention selection and feature learning. Lin et al. [
28] designed self-critical attention (SCAL), which guides model training by evaluating the attention map’s quality and providing strong supervision signals.
However, it is worth noting that these models are convoluted in a limited acceptance domain and cannot effectively capture global context information. Based on these problems, some researchers [
29,
30] proposed self-attention models. J. Si et al. [
7] posed a dual attention matching network based on inter-category and intra-category attention to capture information of human Re-ID video sequences. Dual attention [
31] introduced a self-attention mechanism to capture features that depend on the spatial and channel dimension. CBAM [
32] is a light and effective network. It can collect the local details information and combine them with global critical information, which can make the image features more representative.
The first part we want to improve is the attention mechanism module of the model. Based on a lot of research on attention mechanism algorithms, after the position attention module was proposed, we found that the position attention module was also effective. In addition, we observed that the cascaded modules were more effective than the parallel modules according to experiments. Therefore, we connected the three modules of channel attention mechanism, positional attention and spatial attention in series, and obtained the best series order through experiments, that is, CPSA Attention.
In our opinion, these features of bags and shoes are also within the conv window; however, standard non-attention methods pay more attention to the physical features of pedestrians. This can be proved by the visualization in the
Section 6. At first, we found that channel, position and spatial-wise attention (CPSA) had better performance. After comparing the hot map results with other algorithms, we found that CPSA paid more attention to the features of people’s clothes, bags and shoes, while other algorithms paid little attention to these features. Therefore, in our opinion, peoples’ clothes, bags and shoes are inconspicuous local features.
CPS-Attention is a lightweight attention mechanism implemented in tandem. It can extract multi-level features, including the channel, position and spatial features of different depths. Precisely, it can sensitively capture essential details, such as bags and shoes. We found that these three modules were complementary and enabled the network to capture vital information adaptively.
2.2. Loss Function for DBA-Net
In addition, there is no doubt that the loss function plays a vital role in the process of model optimization. A good loss function can greatly improve the performance of the network. Loss functions commonly used in Re-ID tasks can be divided into classification loss (ID loss) and metric losses. For the former, Zheng et al. [
33] proposed ID differentiated embedding (IDE) to train the model, which is fine-tuned based on the ImageNet [
34] pre-training model. Since IDE is trained by classification loss, it is also called ID loss of pedestrian Re-ID.
However, the performance of models trained only by ID loss is inadequate. Therefore, Chen et al. [
14,
28,
35,
36] combined ID loss and triplet loss to optimize the model and achieved good results. However, they only concentrated on positive samples and ignored the samples’ internal structure. Wang et al. [
15] considered that internal structure information between samples is helpful to optimize the model better. Gong et al. proposed WRLL [
16], which had better adaptability to different datasets.
Furthermore, we notice that the model trained by increasing the network’s output branches properly was robust. Most researchers make improvements based on the global branch of ResNet-50. At the same time, some researchers have proposed a multi-branch strategy. Therefore, we hope to use a multi-branch strategy based on the attention mechanism. We tried different positions and different numbers of branches and design DBA-Net.
Specifically, we used a dual branch strategy to obtain two branches. We made each branch go through GeM [
17]. These two branches can optimize the features of different depths in parallel. In this article, we combined ID loss, triplet loss and WRL loss to optimize DBA-Net. We used the same loss function on each branch and found that these three loss functions were complementary. We determined from the fourth part that DBA-Net had excellent robustness.
3. Dual Branch Attention Network
At present, some algorithms cannot capture some detailed features, such as knapsacks and shoes. Simultaneously, the recognition of only using the global branch model is inferior to the input pedestrian image. In this case, we designed DBA-Net. We first show the overall structure of the DBA-Net and then analyze the CPSA module. Finally, we introduce dual branch strategy and loss functions.
3.1. Architecture of DBA-Net
In Re-ID tasks, ResNet-50 [
8,
13,
17,
20,
37,
38,
39,
40] is the model commonly used as it is simple and effective. For the convenience of comparison with other algorithms, we still use ResNet-50 as the DBA-Net’s backbone network.
Figure 2 shows the network’s overall framework. It comprises four parts: ResNet-50, CPS-Attention, GeM and Triplet-WRLL. We embed the CPSA module behind the first layer and the fourth layer.
The CPSA module consists of three small modules (CAB, PAB and SAB). This is a lightweight attention mechanism implemented through tandem. It can effectively capture discriminant features of different levels. We will introduce their specific structure below. To make the model capture detailed features, such as shoes and bags, we not only use the global network branch but introduce another branch after the fourth layer. Each branch is processed by the GeM layer. Based on ID loss, we also use triplet loss and WRLL to optimize the two branches. Specifically, we put Triplet-WRLL behind the GeM layer and use ID loss at the back of the full connection layer.
3.2. CPSA Module
Inspired by [
6,
14], we designed CPSA to extract abundant features and capture multi-level features of the channel, position and spatial. We found that these three modules were complementary. The model can capture the essential details of a pedestrian’s backpack and shoes. In other words, CPSA enhances the representation ability of the model to pedestrian characteristics.
Channel-Wise Attention Block:
Figure 3 is the channel attention framework. CAB can effectively perceive the relationship between channels and simulate the interdependence between convolution feature channels. We use average pooling (to determine the range) and maximum pooling (to determine the difference part) in parallel to compress the spatial dimension of feature maps. Then, they produce two different one-dimensional context descriptors:
and
. They are integrated through the attention mechanism [
41] to obtain the channel attention map
.
Figure 3 presents the architecture of the CAB block. The input channel is compressed during convolution. The calculation formula of channel-wise attention vector is:
where
and
are parameters of the FC layer.
σ and
δ represent the Sigmoid function and
ReLU function, respectively. The obtained
should be further combined with an original feature map through channel multiplication. CAB can enhance more useful information channels and suppress less useful channels at the same time.
Position-Wise Attention Block:
Figure 4 is a position-wise attention framework. PAB can represent the position relationship between each pixel in input feature maps and aggregate semantically related pixels. First, the input feature maps
are preprocessed to get feature maps
. Then, we multiply feature maps
,
and calculate the value of the mean (average of each column) and max (maximum value for each column). Finally, we obtained the position-wise attention map
by combining
and the
softmax function.
where
represents the influence of
i-th channel on the
j-th channel. We calculate the average value and maximum value of each column in feature maps. Then, we adjust the output size to
. We observed that the fine-grained features will be ignored when using the maximum function alone. Therefore, we also use the average function in parallel.
Spatial-Wise Attention Block:
Figure 5 is the spatial-wise attention framework. SAB is a supplement to CAB and PAB. It can capture more semantic information in spatial dimension. To obtain feature maps of spatial dimension, we let input feature maps go through two pooling layers (maximum pooling and average pooling) to obtain powerful feature maps of different spaces. Then, the feature maps in different spaces are aggregated to obtain the spatial attention map
. SAB can strengthen input features and enhance the consistency of spatial correlation estimation.
3.3. Dual Branch with Generalized Mean Pooling
In Re-ID tasks, we need to solve problems of similar appearance and occlusion. The traditional methods pay less attention to sub-salient information. Therefore, we propose a multi-scale attention mechanism network with dual branches to coordinate the salient and sub-salient information. Specifically, we not only use the global network branch but also introduce another branch after the fourth layer. Each branch is processed by the GeM layer.
The original framework on the basic framework of ResNet-50 only has a branch based on the overall situation. The difference is that we introduce the second branch after the fourth layer. The two branches use the same loss function to train the network in parallel. We use the characteristics of the global branch to test. The double branch strategy can better optimize the model parameters to obtain a more robust deep learning model. Experiments on three public datasets showed that the model trained by this dual branch strategy was robust. We will present the results of DBA-net in the later experiments.
The max-pooling layer and average pooling layer can eliminate redundant information and retain the main features; however, they cannot capture specific areas’ features. Therefore, we use adaptive pooling layer on the two branches, namely generalized mean pooling (GeM) [
17]. The following is the formula:
where
is a hyperparameter that can be adjusted when we train the model. When
→∞, its function is similar to maximum pooling. When
→1, its function is similar to average pooling.
3.4. Loss Function
In this section, we mainly introduce the adaptive weighted rank list loss (WRLL). Based on ID loss, we also employ triplet loss function. The combined use of them can increase the network’s ability to distinguish similar features.
We present the symbols and their meanings that we use.
is the training set.
are the sample and label with serial number
i. The samples in the training set totality is
c and
represents all samples.
represents positive samples.
represent negative samples. Given an image
, our task is to become closer to the positive sample points and to obtain a hypersphere with radius
. We need to train all positive sample points together. The following is the calculation formula:
where
represents the cosine distance between one sample and other samples. To separate the negative sample point from the input as much as possible, we keep a distance of more than
m between the positive sample point and the negative sample point. We have two hyperparameters (
,
).
The number of positive samples in each batch was less than that of the negative samples. To maintain the generalization performance of network, we used the
softmin function to assign weights adaptively. The following is the calculation formula:
where
is the Euclidean distance between the input image and the negative samples. This formula represents the process of mining hard samples with adaptive weighting. When there are samples that hard to identify, the adaptive weight will dynamically allocate weight according to the distance between two samples. In Formula (9), we need adjust the parameter
to achieve the best effect for different datasets. Therefore, we use a more convenient and flexible weighting strategy.
where
is the slope to control the change of weights. To balance the distance between negative samples point and input, the distance between them should be above
.
Similar to the above, the following is the negative sample loss function formula.
We optimize both the positive and negative loss functions at the same time.
Finally, we use the triplet loss function simultaneously and obtain the following learning strategy:
Among them, is a cross-entropy loss function. is a triplet loss function. and both use the same loss function to optimize. We need to fine-tune the coefficients and .
4. Experiment
The experimental environment is Pytorch0.4.1, and the server is Tesla V100 GPUs. We evaluated DBA-Net on three large public datasets: Market-1501 [
18], DukeMTMC-ReID [
19] and Cuhk-03 [
2]. First, we compare DBA-Net’s performance with the latest methods. Then, we show the relevant hyper-parameters processing and ablation experiments. Finally, we make a visual analysis of the network.
4.1. Dataset Description
Datasets: Through investigation and collection, we decided to use three authoritative datasets for our research, namely the three mentioned above. We have a detailed introduction in
Table 1. Specifically, Market1501 contains 32,668 images of 1501 labeled persons of six camera views; 12,936 images of 751 identities were randomly selected as the training set, and 19,732 images from 750 identities were used as the testing set. As a large-scale dataset, DukeMTMC-ReID has 36,411 images of 1404 identities from eight countries; 16,522 images of 702 identities were selected as the training set, and 19,889 images from 702 identities were used as the testing set. Cuhk-03 is a small-scale re-identification dataset with 14,088 images of 1467 identities.
Evaluation Metrics: We used mAP and the cumulative matching feature (CMC) to evaluate the performance of DBA-Net. These are widely recognized in Re-ID task, for the CMC curve. The abscissa of this curve is Rank-n (n = 1, 3, 5···), in which the ordinate is the corresponding precision. We can intuitively find the recognition accuracy of Rank-n through the CMC curve. Rank-n is the possibility that the target is correctly retrieved in the top n recognition. mAP intuitively presents the average accuracy of correct retrieval during testing. It can reflect the overall performance of models.
Other details: We used horizontal flipping, random cropping [
42] and random erasing [
43] to preprocess the image. The size of the input images was adjusted to 256 × 128. Our backbone network was ResNe-50 pre-trained on ImageNet [
34]. The baseline network was BagTricks [
10,
20] that only used the cross-entropy loss function. We used the Adam algorithm optimizer to train models. The initial learning rate (Lr) and weight decay were both 0.0005.
Table 2 presents the setting of the learning rate.
t represents the number of epochs.
4.2. Hyper-Parameter Experimental Analysis
In this section, we analyze the parameters of loss functions. Since there are four parameters, we adopted the following strategy. First, we only used the WRL loss function and determined the appropriate values of and through comparative experiments. Then, we added a triplet loss function and analyzed and . Finally, we used Triplet loss and WRL loss together to analyze and again.
Coefficient
and
: With the loss functions, we found that the experimental results of different coefficient combinations were different. We needed to adjust the loss functions’ coefficients properly. Following the parameter adjustment method of GLWR [
16], we used this method of controlling variables. The initial values were
= 0.5,
= 0.5. For example, we fixed the parameter
= 0.5. We changed the value of
and selected the best experimental result through extensive experiments. Every time we changed by 0.1 to find the approximate value of
. After that, we changed by 0.05 each time to obtain the final value of
. Then, we fixed the parameter
and found the optimal value of
.
Figure 6 shows the optimized results.
We can intuitively find that the coefficients
and
had a significant influence on the model from
Figure 6. Through extensive experiments, we found our model had the best performance when
= 0.2,
= 0.5.
Hyper-parameters
and
: Furthermore, the parameters
and
in Equation (5) also had a significant influence on the performance of DBA-Net. The size of
determined by the dataset represent the radius of the hypersphere. We need to adjust it properly to ensure good performance of the model. Therefore, we performed many comparative experiments on these two parameters. We also used the control variable method.
Figure 7 presents the optimized results.
As shown in
Figure 7, the parameters
and
had a great influence on DBA-Net. According to a large number of experimental results, we have the following findings: For market-1501, DBA-Net had the performance best when
= 1.8, and
= 1.0. For DukeMTMC-ReID, DBA-Net had the performance best when
= 2.0, and
= 1.2. For CUHK-03, DBA-Net had the performance best when
= 1.7, and
= 1.0.
4.3. Comparison with State-of-the-Art Methods
We compared DBA-Net with the state-of-the-art (SOTA) approaches: AGW [
1], CAM [
4], ABD-Net [
6], BagTricks [
9], AANet [
10], RAG-SC [
12], SGSC [
14], GRLL [
16], MGN [
20], GD-Net [
21], IANet [
23], Pyramid [
24], SONA [
27], SCAL [
28], Auto-ReID+ [
44] and Ms-Mb [
45]. We did not adopt the re-ranking trick for the sake of fairness. The ResNet-50 [
42] is the backbone. Pyramid [
24] is an exception, where the backbone network is ResNet-101.
Results on Market-1501: We show these results in
Table 3. In these methods, it should be pointed out that Pyramid [
24] uses a more powerful backbone and complex fringe features. It is a representative of global feature algorithms. SGCS [
16] uses a cascading strategy to extract different potential features in each stage, and each stage integrates these features for the final representation. However, the global features ignore the local details. Cascade strategy also greatly increases the complexity of the network. Finally, our approach was more competitive than other SOTA approaches.
Results on DukeMTMC-ReID: We show these results in
Table 4. On this dataset, DBA-Net also achieved the best effect on Rank-1. It was 4% higher than SGCS [
16]. In addition, our results on lightweight networks were higher than Pyramid [
24] by 4%.
Results on CUHK-03: We show these results in
Table 5. Compared with the two datasets above, the CUHK-03 dataset was more challenging, as this dataset has few samples and the occlusion problem is serious. However, DBA-Net can capture more detailed features, such as shoes and bags and can accurately retrieve the target pedestrian. DBA-Net surpassed Pyramid [
24] by 7.5% in Rank-1 and 8.4% in mAP. In addition, DBA-Net outperformed SGCS [
16] and achieved a new SOTA.
We compared DBA-Net against the algorithm with the best current experimental results. Specifically, Ms-Mb [
45] achieved 95.8% top-1 accuracy and 88.9% mAP on Market-1501. SGSN [
16] achieved 91.0% top-1 accuracy on DukeMTMC-Re-ID. Additionally, SGSN [
16] obtained 84.7% top-1 accuracy and 81.0% mAP on CUHK-03, as shown in
Table 3,
Table 4 and
Table 5. Nevertheless, the experimental results of DBA-Net clearly exceeded these. Especially on the CUHK03 dataset, DBA-Net surpassed SGSN [
16] by 1.7% in Rank-1 and 2.2% in mAP.
Through comprehensive analysis of these experimental results, DBA-Net had the best performance. DBA-Net’s mAP values on three datasets were 90.3%, 83.1% and 83.2%, which are higher than the above suboptimal network. The model achieved the new SOTA on these three popular public datasets.
4.4. Ablation Experiment
In this section, we demonstrate the effectiveness of CPSA, Triplet-WRLL and DBA-Net by ablation experiments. The network performance will decline when the parameter
(in the part of the loss function) is a large number. We set it to 0.3.
Table 6 presents the results of the ablation experiments.
First, we used the baseline of BagTricks [
10] that only uses the cross-entropy loss function and kept this as our baseline. Then, we stacked GeM, Triplet, WRLL and CPSA modules on the baseline in turn. We found that each module effectively improved the performance of the network. As we expected, the network trained by the combination of these three loss functions was robust. In particular, WRL loss had outstanding performance in the cuhk-03 (detected) dataset. In addition, CPS-Attention made the network more recognizable to input images. Finally, DBA-Net had the best performance.
Furthermore, we also made graphs to demonstrate the ablation experiments more intuitively as shown in
Figure 8 and
Figure 9.
4.5. CPS-Attention
In this section, we prove the effectiveness of CPS-Attention according to comparative experiments. Specifically, we added other advanced attention modules to the same benchmark network for comparison, including Dual attention [
31], CBAM [
32] and GL-attention [
16]. To be fair, these experiments were all conducted on DBA-Net. (
Figure 10,
Figure 11 and
Figure 12) show the comparative experimental results of the attention models.
CPS-Attention combines three modules and focuses on the local features, such as pedestrian’s backpack and shoes. In other words, CPSA enhances the representation ability of the model. For market-1501, CPS-Attention was higher than other attention blocks on mAP and Rank-1. Especially for DukeMTMC-ReID and CUHK-03, CPS-Attention had clearly superior performance. Finally, CPSA demonstrated remarkable improvement on DBA-Net.
6. Conclusions
In this paper, we proposed a novel attention network (DBA-Net). We embedded a powerful attention module (CPSA) into the network. Furthermore, we used complementary loss functions (Triplet-WRLL) on the two branches of the network. Finally, extensive experiments proved that DBA-Net achieved advanced performance. In addition, we discovered that the research on Re-ID in the current domain has reached a bottleneck, and it is, thus, difficult to make a breakthrough in evaluation metrics. At present, some researchers are conducting cross domain work and have achieved certain results. However, there is still a huge research space in cross-domain tasks. The research of cross-domain Re-ID is a new research trend. In the future, we intend to apply DBA-Net to the cross-domain Re-ID task.