1. Introduction
In recent years, with the development and progress of society, public safety has become increasingly important. Video surveillance has played a crucial role in security. Surveillance cameras produce a large number of video files every day, and relying on relevant personnel for video viewing is inefficient, making it challenging to locate and track target pedestrians. Meanwhile, the pedestrian image captured by the surveillance camera has a low resolution. Therefore, face recognition technology cannot determine identity. It is necessary to combine the overall characteristics to locate the target pedestrian. Thus, person re-identification has become an important research objective in video surveillance. Person re-identification is an image retrieval technology that realizes cross-camera recognition of specific pedestrians. It extracts features from a given pedestrian image and performs similarity ranking to retrieve all photos under its cross-device. Person re-identification technology can solve the visual limitations of the single fixed camera scene to a certain extent. It can be combined with other computer vision technologies, such as pedestrian detection, face recognition, and crowd density estimation, to form an image pedestrian analysis system, which will have critical applications in public security. Problems such as occlusion, light change, and camera angle difference in the actual scene [
1,
2,
3] result in the changeable posture of pedestrians in the image, which leads to insufficient feature extraction of pedestrians and makes it difficult to measure the similarity later. All these problems result in low accuracy of the obtained results. Therefore, obtaining more comprehensive pedestrian characteristics is a great difficulty in person re-identification research.
For person re-identification, the main task is to extract more discriminant feature representation from pedestrian images. According to the different feature extraction strategies, the person re-identification algorithm model can be divided into classification and verification models. Generally, the classification model takes the instance loss [
4] as the loss function. The verification model inputs two images at a time. It uses the Siamese neural network to extract feature representation [
5] and perform feature fusion, then calculates the binary loss. According to different feature extraction methods, existing methods can be divided into global and local feature representation learning. Global feature representation learning extracts a global feature representation for each pedestrian image. Since early studies regard person re-identification as an image classification problem, most early methods use global feature representation learning methods. However, in the actual scene, the pedestrian image captured by the camera is usually incomplete, and the noise area in the image will cause significant interference to the global features. Meanwhile, due to the change in pedestrian attitude, the inconsistency of image frame attitude detected by multiple cameras will also make the global features unable to be matched, so the local features have been widely applied [
6,
7,
8]. The global feature can obtain the overall information perception, while the local feature can obtain the image details. By synthesizing the two features, combining the global feature and the local feature can achieve better results.
In convolutional neural networks, deep networks easily respond to semantic features, while shallow networks easily respond to image features. However, deep features have less spatial information, which is not conducive to feature acquisition. Meanwhile, shallow features contain fewer semantic features, which is not conducive to image classification. Therefore, combining deep and shallow features can prevent the loss of crucial information in the image [
9,
10] and better meet the needs of person re-identification.
In practical applications, the CNN operator has the problem of local receptive field limitation. To obtain global information, multi-layer stacking is needed, but the information will gradually disappear with the increase in layers. To solve this problem, researchers have applied Transformer to the field of computer vision [
11,
12]. Transformer contains multi-head self-attention mechanisms, which can effectively obtain global information. Furthermore, multi-head self-attention replaces operations such as convolution and pooling, which cannot easily cause image information loss and can enhance model expression ability. However, due to the low computing efficiency of the Transformer module, it cannot completely replace CNN. Image and video have more information than text, so the computing cost of image processing using Transformer is still huge.
Aiming at the problems existing in the field of person re-identification, this paper proposed a new algorithm with the main contributions as follows:
We proposed a person re-identification network based on a dual-branch structure, called MSMG-Net. The multi-scale branch realized the complementary of features from different layers. It retained semantic information while avoiding the loss of details. The multi-granularity branch obtained output features of varying granularity. It took into account both global and local information, which increased the diversity of features.
In the multi-scale branch, the bidirectional cross-feature pyramid network achieved feature fusion. It obtained the shallow position supervision feature and the deep semantic supervision feature. Meanwhile, it combined the dual attention mechanism of position and channel to make the network pay more attention to the identity characteristics of pedestrians.
In the multi-granularity branch, we combined CNN and Transformer to capture long-distance dependencies between image pixels and obtained global local features by horizontally splitting the feature map. Meanwhile, we improved the multi-head self-attention mechanism in Transformer, adding relative position coding in width and height directions. In this way, the Transformer structure can obtain more accurate relative position information and improve feature expression ability while capturing long-distance dependence.
The algorithm in this article proposed a new idea, which adopted a dual-branch structure and combined multi-scale and multi-granularity branches. In the multi-scale branch, we proposed a bidirectional cross-pyramid structure that combined the attention mechanism to achieve the fusion of multi-scale features. In the multi-granularity branch, CNN and Transformer were combined to improve the multi head self-attention mechanism, adding relative position encoding on the width and height dimensions, and we used horizontal segmentation to obtain local features. We considered more complete feature categories and achieved more effective feature extraction, achieving higher accuracy.
The chapters of this article are arranged as follows:
- 1.
Section 1 introduces the research background and value of person re-identification algorithms, then introduces its main ideas and technical routes, and finally summarizes the research content and innovation points.
- 2.
Section 2 introduces person re-identification algorithms based on traditional and deep learning algorithms, then introduces the ideas of classic network applications. Finally, it analyzes the shortcomings of existing algorithms and research directions that can be improved.
- 3.
Section 3 provides the main structure of a dual-branch person re-identification algorithm based on multiple feature representations. The network structure includes two parts: multi-scale and multi-granularity branches. The multi-scale branch compensates for the loss of feature information caused by using only high-level semantic features. The multi-granularity branch makes the extracted pedestrian information more comprehensive.
- 4.
Section 4 introduces person re-identification datasets, algorithm evaluation indicators, and experimental configuration. Then, we designed ablation and comparative experiments and analyzed the results. Finally, the experimental results are visually displayed to verify the algorithm’s performance.
- 5.
Section 5 summarizes the paper’s main research content and innovations. Then we analyze the shortcomings of the algorithm and future research directions.
3. Proposed Method
The algorithm proposed in this paper took ResNet50 as the backbone network and adopted a dual-branch structure: multi-scale and multi-granularity branches. The specific network structure is shown in
Figure 1. In the multi-scale branch, we used the bidirectional cross-feature pyramid network to achieve feature fusion and obtained the multi-scale features of pedestrians. In the multi-granularity branch, we obtained the local features through horizontal splitting of the feature map. Meanwhile, we combined CNN and Transformer modules to establish connections between long-distance pixels and obtain global features using the improved multi-head self-attention mechanism. Finally, the cross-entropy loss and triple loss were combined to train the network.
3.1. Multi-Scale Branch
The high-level features of the convolutional neural network contain the richest semantic information but will lose detailed features in the original image in the forward propagation process. Shallow features contain rich image details but will introduce more noise interference, affecting the discriminability of features. The middle-layer features can balance the problems between them and compensate for the insufficient feature representation caused by only using the semantic features of a single layer [
9]. Therefore, this paper proposed the multi-scale branch and adopted the bidirectional cross-feature pyramid network to achieve the feature fusion of different layers and avoid the loss of feature information.
3.1.1. Bidirectional Cross Feature Pyramid Network
Inspired by Bidirectional Feature Pyramid Network (BiFPN) [
20], the bidirectional cross-feature pyramid network proposed in this paper added bidirectional cross-scale connections based on the original Feature Pyramid Network (FPN). Unlike FPN, the bidirectional cross-feature pyramid network aimed to integrate features of different scales into the final match. So, only the fused features of the deeper side were selected as the output and introduce the attention module to improve the network performance. The shallow features contain too little semantic information, so we applied the output features of Res3, Res4, and Res5 for feature fusion. The output features of Res3 and Res4 were fused to obtain the shallow position supervision feature, and the output features of Res4 and Res5 were fused to obtain the deep semantic supervision feature. The specific network structure is shown in
Figure 2.
Firstly, the shallower feature
and deeper feature
were obtained, and the feature dimension was converted to 256. They had different resolutions, so it was necessary to realize a cross-scale connection between the two layers. We adopted the global max pooling with
pooling kernel to downsample and the Nearest Interpolation to upsample for feature fusion. Meanwhile, as shown by the curved arrow in the figure, the original feature map was added to the output in each layer of FPN, providing additional discriminative information for the output feature. Because different feature maps have various contributions to the output in the feature fusion process, this paper used the fast normalization fusion method to add weights for each feature input. The calculation formulas are shown in Equations (1) and (2).
where
,
,
,
, and
are additional weights for each feature input and parameter
is set to ensure numerical stability.
Finally, the bidirectional cross-feature pyramid network took the characteristics of the deeper side as the output and restored its dimension to the same as the deeper input . Finally, the shallow position supervision feature and the deep semantic supervision feature obtained from the multi-scale branch worked together to make up for the insufficient feature representation caused by using only a single-layer feature.
3.1.2. Dual Attention Mechanism
In the actual scene, the pedestrians may be blocked by various obstructions, so the network should pay more attention to the characteristics of the pedestrian’s body parts. The attention mechanism can use the correlation between features to help the model focus on more relevant features and reduce the ambiguity of features in the final task. In the task of person re-identification, the attention mechanism focuses on areas highly related to pedestrian information with high weight. It ignores irrelevant information with low weight to distinguish pedestrians with different identities. Based on this, we add the attention mechanism to the multi-scale branch. Before the feature is input into the bidirectional cross-feature pyramid network for feature fusion, the information is reweighted through the attention mechanism. Inspired by Fu et al. [
21], we used the dual attention mechanism, as shown in
Figure 3. The blue part is the Position Attention Module (PAM), and the orange is the Channel Attention Module (CAM). An effective combination of them can solve the problem of local feature dependency and spatial mismatch of the image.
In the position attention module, the input feature was divided into three branches, and the new feature map was obtained through convolution operation. The output feature map of the first branch was transposed and multiplied by the second branch feature map. Then the result was calculated through the softmax layer to get the spatial attention map. Then it multiplied the third branch feature map by the transposition of the spatial attention map and adjusted the output size to be the same as the input feature map. Finally, we multiplied it by a learnable parameter
and added the result to the input feature map. The calculation formula of
is shown in Equation (
3).
where
is the output of the attention module,
F is the input feature map, and
is initialized to 0 and updated gradually.
The channel attention module directly took the original feature map as input, multiplied the original feature map and its transposition, and obtained the channel attention map through the softmax layer. Finally, it multiplied the transposition of the input feature map and the attention map, and the result was multiplied by a learnable parameter
. The calculation formula of
is shown in Equation (
4).
where
is the output of the attention module,
F is the input feature, and
is initialized to 0 and updated gradually.
Finally, the outputs of the two modules were fused to obtain the final output feature map . Before the feature map is fused through the bidirectional cross-feature pyramid network, using the attention mechanism retains more effective information and avoids the interference of irrelevant features.
3.2. Multi-Granularity Branch
Due to the complex environment in the realistic monitoring scene, some key details may be ignored if only using global features, and the overall perception of global information is missing if only using local features. Therefore, this paper proposed the multi-granularity branch to obtain both global and local features of pedestrians to avoid the loss of detailed features.
3.2.1. Global Feature Branch
The convolutional neural network uses convolution and pooling operations to acquire information but cannot acquire advanced features. Based on this, this paper combined CNN with Transformer based on the improved multi-head self-attention mechanism. After the CNN structure, the Transformer module was connected with a depth of 8. It took the output feature of Res4 as the input of the Transformer module and re-weighted it to obtain the global feature of pedestrians. The Transformer module effectively solved the problem that different feature regions cannot establish long-distance dependence on global information so that the model can accurately identify the information of the pedestrian during the training process.
The Transformer structure mainly comprises multi-head self-attention mechanisms and feedforward neural networks. The feedforward neural network includes linear transformation and the ReLU activation function. It can enhance the nonlinear representation ability. The multi-headed self-attention mechanism includes multiple self-attention modules. The self-attention module is a special case of the attention mechanism because the sequence matches itself to extract the semantic dependency between each part. Its input is the feature map of CNN output and can obtain each pixel’s weight using the nonlinear function. In the training process, the influence size of each pixel in the feature map is determined by the weight value so that the model pays attention to the relationship between features and effectively improves the model’s generalization ability. The calculation process maps the request and key values from input to output. The request function and corresponding key value determine the weight assigned to each value. Its calculation formula is shown in Equation (
5).
where
Q,
K, and
V are query, key, and value, respectively, and
is the dimension coefficient.
The multi-headed self-attention mechanism uses multiple self-attention modules in parallel. Its calculation formula is shown in Equation (
6).
where
. We set
.
The self-attention layer does not include the location information coding of the two-dimensional image, which limits its expression ability in processing visual tasks. Therefore, when constructing the multi-head attention layer, two-dimensional relative position coding was added to improve the ability of the multi-head self-attention mechanism to understand two-dimensional images. Its structure is shown in
Figure 4. The two-dimensional relative position coding considered the two-dimensional features of the image as a combination of vectors. It constructed vectors
H and
W to represent the relative positions of the two-dimensional features in the height and width directions, respectively. The constructed two-dimensional relative position coding can inject relative position information into each pixel when processing the two-dimensional features. Each attention head used a pair of trainable two-dimensional relative position codes, whose value depended on the distance between pixels and was constantly updated during the network training process to enhance the understanding of two-dimensional features.
Finally, in the testing process, the multi-scale features obtained from the multi-scale branch and the global and local features obtained from the multi-scale branch were concatenated. After information fusion from the double-branch structure, we obtained a more complete and rich pedestrian feature representation.
3.2.2. Local Feature Branch
To retain more details, the downsampling factor of the Res5 stage was set to 1 in the local feature branch so that the resolution of the output feature map of Res4 and Res5 was the same. Then the output feature map of Res5 was split horizontally. The method proposed by PCB [
6] network divided the feature map into six parts from top to bottom. However, excessive segmentation will lead to a lack of connection among local features. In addition, if there are too many segmentation particles, the feature scale of each region is too small, which makes it challenging to learn discriminative features from the local area. Therefore, the algorithm in this paper selected two processing methods of horizontal segmentation into two parts and three parts, respectively. The output feature map of Res5 was pooled by global max pooling with
and
kernels, respectively. So the output features of
and
were obtained. Then we split them horizontally to obtain 2048-dimensional feature vectors,
and
, and reduced their vectors to 256. Finally, we obtained the outputs through the fully connected layer.
3.3. Loss Function
In the training process, we took the cross-entropy loss and triplet loss [
22] as the loss function.
We applied the cross-entropy loss to classification learning in the feature learning stage and regarded the training process as a multi-class classification problem, and the network applied the cross-entropy loss to all output features of all branches. Firstly, it processed the features by the softmax function and constructed the probability distribution to obtain the classification result of pedestrian identity. The calculation formula of the softmax function is shown in Equation (
7).
where
N is the batch size,
f is the learning feature,
is the weight vector corresponding to category
j of the fully connected layer, and
K is the number of categories.
Finally, we used the cross-entropy loss as the objective function, which calculated the distance between the actual and expected output. The smaller the value was, the closer the output was to the expected result. Its calculation formula is shown in Equation (
8).
where
N is the number of pedestrian categories,
is the prediction vector, and
is the real label vector.
The triplet loss was applied to the measurement learning phase. An input triplet consisted of three images: Anchor, Positive, and Negative. Through optimization, the distance between Anchor and Positive was smaller than between Anchor and Negative, so the intra-class gap was smaller, and the inter-class gap was larger. It realized the feature clustering of the same pedestrian. The triple loss randomly selected samples from the training set, and it was easy to select a relatively simple triplet that cannot realize the optimization of the network. To solve this problem, [
23] proposed a triplet loss based on difficult sample mining. In each training batch, it selected
P pedestrians and
K images belonging to these pedestrians randomly. For each Anchor in this batch, it selected a Positive with the farthest distance and a Negative with the nearest distance to form a triplet. Its calculation formula is shown in Equation (
9).
where
,
, and
are the characteristics of Anchor, Positive, and Negative, respectively,
is the Euclidean distance between calculated features, and
is the boundary parameter controlling the minimum interval between positive and negative sample pairs.
The triplet loss was used to train the global features of all outputs but not for local features. Because local features may have problems such as feature misalignment, if the background part was taken as a sample, the model may learn wrong information, affecting the accuracy.
4. Experiments
4.1. Datasets
Common pedestrian datasets were captured by few cameras and have insufficient images and pedestrian identities on a small scale. At the retrieval stage, common pedestrian datasets have a single query image for each identity and lack practicality in images. To avoid the above problems, we selected Market-1501 and Duke MTMC-ReID datasets for the experiment. These two datasets contain sufficient images and pedestrian identities, were taken from multiple cameras, and have adequate query images collected from actual scenes.
Market-1501 The images in the Market-1501 [
24] dataset were taken on the campus of Tsinghua University, including 32,668 images of 1501 identities. Training sets include 12,936 images of 751 people, and test sets include 19,732 images of 750 people. Six cameras capture the images in the dataset. The average number of training datapoints per identity in the training set is 17.2, and the average number of test datapoints per identity in the test set is 26.3.
DukeMTMC-ReID DukeMTMC-ReID [
25] dataset is a subset of pedestrian recognition in the Duke dataset, which includes 36,411 images collected by 8 high-resolution cameras. Among them, the training set contains 16,522 images of 702 pedestrians randomly sampled, and the test set contains 17,661 images of another 702 pedestrians. The average number of training datapoints per identity in the training set is 23.5, and the average number of test datapoints per identity in the test set is 25.2.
4.2. Evaluation Metrics
In terms of performance evaluation, MSMG-Net is evaluated using Cumulative Matching Characteristic (CMC) and mean Average Precision (
mAP). Rank-k accuracy in CMC refers to including the matched image in the first
k results according to the similarity score. This paper uses Rank-1 to evaluate the proposed method.
MAP calculates the area under the precision–recall curve for each query, called the average precision, and its calculation formula is shown in Equation (
10).
where
is the area under the curve of accuracy and recall,
k is the total number of categories, and
is the average of all kinds of
.
4.3. Experimental Configuration
We implemented the algorithm on the PyTorch framework and use NVIDIA Quadro RTX 8000 GPU for acceleration. We used the ADAM optimization algorithm to optimize the model’s training and set the momentum to 0.9. Experiments show that initializing the network structure with the pre-trained weight parameters can accelerate the network convergence speed and improve the network performance. Therefore, the algorithm model chose to fine-tune the results of the trained ResNet50. In the training process, we set the resolution of the input image to and expanded the datasets by random erasure and horizontal flip. The size of each training batch was 32, with a total of 160 iterations. The network dynamically adjusted the learning rate. The initial learning rate was set to . When the number of iterations reached 120, the learning rate was adjusted to . When the number of iterations reached 140, the learning rate was adjusted to . The edge hyperparameter of the triplet loss function was set to 0.3.
4.4. Comparision with State-of-the-Art Methods
The experimental results of the algorithm model on the two datasets are shown in
Table 1. We compared the experimental results of MSMG-Net and other mainstream person re-identification algorithms on the same datasets. Then we analyzed the existing model’s shortcomings and completed the algorithm’s evaluation in this paper. The optimal results are shown in bold font.
We can see from the table that for the mAP and Rank-1 indicators, the results on the Market-1501 dataset were 88.6% and 96.3%, respectively, and the results on the DukeMTMC-ReID dataset were 80.1% and 89.9%, respectively. Compared with the PCB+RPP [
6] and HPM [
7] algorithms that apply local features, MSMG-Net achieved more significant performance improvement on both datasets for mAP and Rank-1 indicators. As can be seen from the experimental results, excessive block granularity will lead to the loss of discrimination of local features, and only considering local features without global information will lead to a lack of connections between features. Combining global features with local features is a better feature extraction method. Meanwhile, compared with the MGN [
8] network using global and local features, MSMG-Net performed better, especially in the mAP index, indicating that the proposed algorithm showed a more stable detection effect in the overall detection process. Compared with MSFNet [
9], MSMG-Net performed better using multi-scale features. Meanwhile, compared with Vit-Transformer [
11] and Swin-Transformer [
12] algorithms that apply the Transformer module, MSMG-Net combined CNN with the Transformer module based on improved multi-head self-attention mechanisms and combined local features and multi-scale features. It achieved better performance than simply using the Transformer module.
Compared with the ISP [
29] algorithm with better performance in the mainstream algorithm, the mAP of our algorithm on the Market-1501 dataset had no improvement, and the Rank-1 improved by 1.0%, while the mAP and Rank-1 on the DukeMTMC-ReID dataset improved 0.1% and 0.3%. Regarding the mAP index, the algorithm in this paper did not achieved a relatively noticeable performance improvement. However, the ISP [
29] algorithm introduced a semantic analysis method for the segmentation of human parts and objects and the calculation process was relatively complex.
MSMG-Net considered more types of features. The combination of multi-scale and multi-granularity branches could achieve sufficient feature extraction and obtain more complete and comprehensive pedestrian information. At the same time, the various structures proposed in the network can help improve accuracy. It can be seen that the bidirectional cross-pyramid structure can effectively combine features of different scales in the process of feature fusion. The combination of CNN and Transformer can effectively capture long-distance dependencies between image pixels and obtain complete global information. Local information can be effectively obtained by splitting the feature map horizontally. Based on the above experimental results, compared with the current mainstream algorithms, MSMG-Net showed better results on the two datasets, which can verify its effectiveness and advancement.
4.5. Ablation Study
4.5.1. Validity Verification of Each Branch
MSMG-Net comprised the multi-scale branch and the multi-granularity branch, among which the multi-granularity branch also included the global and local feature branches. To prove the effectiveness of each part proposed in this paper, model simplification experiments were carried out on two datasets, as shown in
Table 2. The baseline used ResNet-50 as the backbone and took cross-entropy loss and triplet loss as the loss function. The optimal results are shown in bold font. √ indicates that the network applied this module during the experiment process.
It can be seen from the table that the performance gradually improved by adding multi-scale features, global features, and local features based on the baseline. Compared with the baseline, combining the two branches increased 15.4%/6.7% and 17.6%/10.1% in mAP/Rank-1 on the Market-1501 and DukeMTMC-ReID datasets, respectively. According to the analysis of the experimental results, the multi-scale branch solved the problem of insufficient feature expression caused by the single-layer semantic features and reduced the loss of feature information; the multi-granularity branch simultaneously obtained global and local features of pedestrians. Global features enhanced the overall information perception of pedestrian images, and local features focused on image details, increasing feature diversity. Finally, the comprehensive network combined multi-scale and multi-granularity features to complement each other and obtained sufficient pedestrian identity information.
4.5.2. Validity Verification of Each Part in the Multi-Scale Branch
We used the bidirectional cross-feature pyramid network for feature fusion in the multi-scale branch. The output features of Res3 and Res4 were fused to obtain the shallow position supervision feature, and the output features of Res4 and Res5 were fused to obtain the deep semantic supervision feature. As shown in
Table 3, these two features’ effectiveness was verified when other branches remained unchanged. √ indicates that the network applied this module during the experiment process. It can be seen from the table that both the shallow position supervision feature and deep semantic supervision feature can improve the accuracy of person re-identification. However, the combination of the two features can obtain optimal performance.
The multi-scale branch added a dual attention mechanism based on feature fusion. As shown in
Table 4, the effectiveness of position and channel attention was verified, with other branches unchanged. √ indicates that the network applied this module during the experiment process. It can be seen from the table that adding an attention mechanism to give different weights to different features and fuse them can further improve the accuracy. Meanwhile, the dual attention mechanism can achieve the best performance compared with the application of position and channel attention alone.
4.5.3. Validity Verification of Each Part in the Multi-Granularity Branch
The multi-granularity branch acquired global features by combining CNN and Transformer. It added relative position coding in the width and height directions to improve the multi-head self-attention mechanism and obtained global features. To verify the effectiveness of the proposed method, the single-dimensional and two-dimensional relative position coding were compared, with other branches unchanged. As shown in
Table 5, applying two-dimensional relative position coding can achieve higher accuracy.
When obtaining local features, the multi-granularity branch horizontally split the feature map into two and three parts. To verify the effectiveness of the proposed method, we split the feature map into two, three, and four parts horizontally. As shown in
Table 6, we combined the different segmentation scales and kept other branches unchanged. √ indicates that the network applied this module during the experiment process. It can be seen from the table that splitting the feature map horizontally into two, three, and four parts can obtain the best performance. However, performance improvement was insignificant compared to the segmentation into two and three parts. Therefore, we combined the feature maps split to two and three parts to reduce the calculation amount while ensuring performance.
4.6. Visualization Result
The Rank-10 retrieval results of the same query image using baseline and the algorithm in this paper on two datasets are shown in
Figure 5 and
Figure 6. The green number in the figure represents the correct query result, and the red number represents the wrong query result. As shown in the figures, the retrieval results of this paper’s algorithm achieved higher accuracy than the baseline.
5. Conclusions
This paper proposed a new dual-branch person re-identification algorithm, called MSMG-Net. It combined multi-scale and multi-granularity branches. The multi-scale branch obtained multi-scale features of different layers, and the multi-granularity branch obtained multiple fine-grained features. The network can extract more comprehensive and sufficient pedestrian information and improve the accuracy of person re-identification. In addition, a dual attention module was added to the multi-scale branch to make the multi-scale feature pay more attention to the critical information in the pedestrian image. In the multi-granularity branch, the Transformer module with improved multi-head self-attention mechanisms was applied to capture the long-distance dependence between image pixels and obtain a more accurate relative position relationship. The experimental results showed that compared with most existing person re-identification algorithms, MSMG-Net improved performance on the Market-1501 and DukeMTMC-ReID datasets.
However, the current algorithm still has certain limitations and requires further research and experiments:
Each module proposed in this article has improved the accuracy of the pedestrian recognition algorithm to a certain extent. Still, it increases the complexity of the model and is unsuitable for large-scale deployment and application. In the subsequent improvement process, lightweight networks can be introduced to ensure accuracy while improving the real-time performance of the algorithm.
The algorithm in this article has been tested on public datasets with relatively limited data volume and has not been tested in actual monitoring scenarios. Testing data in real environments is also necessary if applied to practical systems. Therefore, improving the model’s generalization ability is also an important research objective.