1. Introduction
Person re-identification (ReID) is a critical aspect of intelligent video analytics, particularly in situations where facial recognition fails or due to poor camera quality. With the rapid advancements in AI technology, scholars are becoming increasingly interested in integrating AI into security-related applications. Given the limitations of standard surveillance cameras and the sub-optimal performance of AI technologies, such as facial recognition, researchers have conducted extensive studies on the utilization of ReID within public security intelligent monitoring systems. While CNN-based methods have dominated ReID research for a long time and have recently experienced significant advances [
1,
2,
3,
4,
5], the representation of global contextual information, critical in sophisticated computer vision tasks, continues to be a challenge. Although CNNs are powerful in feature extraction related to local information, they often struggle to represent contextual information sufficiently.
The Transformer architecture has been the focus of considerable attention in recent years [
6]. This interest can be attributed, at least in part, to the success of self-attention-based Transformers in natural language processing (NLP), inspiring scholars to explore their potential application to various computer vision tasks, including image classification, object detection, and semantic segmentation [
7,
8,
9,
10]. Self-attention-based Transformers have demonstrated exceptional capabilities of capturing long-distance dependencies, presenting an attractive alternative to CNNs. The vision Transformer (ViT) [
7] and data-efficient image Transformers (DeiT) [
9] are two models that have replaced the conventional CNN backbone with a pure Transformer. In ViT, input images are partitioned into non-overlapping patches, each assigned a unique token. These patches are then processed using self-attention-based Transformer blocks that capture global relations and extract features for classification. Although Transformer-based models such as ViT and DeiT have inspired considerable interest, their utility for high-precision images is limited, as their ability to extract local features is relatively weak, and their use requires significant computational power, thereby hindering adoption for computer vision tasks. As a result, researchers have been actively investigating methods to integrate the Transformer architecture with CNNs to leverage both their strengths [
11,
12]. Several studies have used the Transformer in the CNN backbone by direct embedding, which not only allows for a more comprehensive assimilation of features and information, but also allows for a lower computational consumption than a pure Transformer model. Examples of these studies are AA-ResNet [
13] and BoTNet [
14].
The ReID task is recognized for its intrinsic challenges, including subtle inter-class differences, significant intra-class variability, and heightened complexity relative to other computer vision tasks. In public spaces, individuals frequently wear similar clothing, bags, etc. (as depicted in
Figure 1), resulting in necessitating comprehensive imformation encompassing long-distance feature dependencies and local features, especially fine-grained features. However, embedding long-distance dependencies by merely applying self-attention inevitably results in loss of fine-grained features, which are widely acknowledged to be crucial to the model’s performance. Consequently, developing specific modules that can balance the extraction of diverse features is indispensable for adapting the hybrid structure of CNNs and Transformers to the ReID task.
In this paper, we introduce a new ReID framework called DWNets, which can learn robust feature representations for ReID tasks. DWNets employs a parallel architecture that combines CNN-based local features and Transformer-based global features. Considering the differences between CNN and Transformer features, we added a convolutional activation module for the Transformer branch, containing 1 × 1 convolution and BatchNorm [
15], LayerNorm [
16] to balance these differences and facilitate feature fusion. We used a specially designed feature fusion gate (FFG) with dynamic weights to fuse CNN-style and Transformer-based features to reduce fine-grained feature loss.
Due to the specificity of the ReID task, multiple embeddings of long-distance dependencies are not appropriate. Hence, the ideal structure of the DWNet framework may vary depending on the CNN backbone used. We employed two representative backbones: ResNet [
17] and OSNet [
18]. In ResNet, we replaced the original CNN layer with a CNN-Transformer parallel structure in the fourth layer of the network. Conversely, in the lightweight network OSNet, we enhanced each residual block in the first layer of the network. Compared to the original model, our models achieved 2.5% and 2.2% mean average precision (mAP) improvements on the Market1501 dataset, while requiring minimal change in the number of parameters and computational consumption compared to the original model.
Our contributions are summarized as follows:
To enable the ReID model to retain the powerful ability to extract local features of CNNs while also acquiring long-distance dependencies without exceeding resource consumption limits, we conducted extensive experiments to investigate the feasibility and challenges of using a neural network model with a parallel structure of both CNNs and Transformers in the ReID task;
We propose the FFG to iteratively fuse CNN-based local features with Transformer-based global representations based on the problems identified in the experimental results. We experimentally verified the general applicability of the FFG;
We propose a high-performance ReID framework called DWNet, which is based on FFG. DWNet has an ability to fuse local features and global representations based on specific conditions. It outperforms the original baseline in the ReID task with comparable parameter complexity and computational consumption, demonstrating its potential to be the backbone of the ReID model.
3. Methods
With the powerful local feature extraction capability of CNN, the CNN-based model achieves higher accuracy at a lower cost, boosting the rapid development of computer vision. However, it focuses on aggregating local features, which hinders its capacity to acquire global representations, a limitation inherent to its structure. Although several techniques have been developed to overcome this challenge, they are restricted by their own structural problems and thus fall short of providing significant improvement. On the other hand, the Transformer-based model has an innate ability to capture global representations, thanks to the self-attention mechanism that enables it to capture long-distance relationships of sequences. Integrating CNN and Transformer network structures to enhance model performance in ReID tasks presents a challenging problem.
Drawing on [
11,
12,
14], we attempt to implement a CNN and Transformer hybrid architecture that does not considerably increase computational demands while enhancing accuracy in the ReID task. Directly integrating a Transformer into a CNN leads to fine-grained features loss. To resolve this issue, we propose a parallel network structure called DWNet. Given that Transformer-based neural networks require extensive computations, we employ CNN as the foundation of the DWNet framework.
DWNet’s primary concept is to utilize a parallel-merge structure by including CNN and Transformer branches for the fusion of local features and global representations. An essential aspect is a custom mechanism that dynamically adjusts the channel weights of the branches to minimize multiscale feature loss during branch merging. There are two main structures of DWNet. The first employs MHSA and FFG directly in the residual blocks, single-branch CNN, and multi-branch CNN, exhibited in
Figure 2, where the network structure is adjustable by tuning the number of these residual blocks. The second replaces a specific layer of the original CNN with CNN and Transformer in parallel, using FFG in the connected layer as illustrated in
Figure 3.
Based on our experiments, we have concluded that incorporating the self-attention mechanism multiple times to embed global long-distance feature dependency is often less effective than using it only once. This is especially true when it is overused. We have determined that while FFG within a residual block or stage can achieve a local optimum through adjustment of the weight parameter, using multiple residual blocks or stages to achieve a local optimum does not guarantee a global optimum. Our experimental results have enabled us to create the most effective DWNet structure for different CNN backbones, including two representative backbones—ResNet and OSNet—for the ReID task.
3.1. Feature Fusion Gate
There is misalignment [
12] between the feature maps of the CNN branch and the output of the Transformer branch. Moreover, the simple connection is not well suited to the ReID task and will inevitably cause loss of fine-grained feature information in the CNN branch. To solve it, we propose for the FFG to adjust the weight of the feature map of the CNN and Transformer branches according to stimulus content, and then consecutively couple CNN-based local features with Transformer-based global representations by summing the feature map of the CNN branch and Transformer branch according to this weight. As illustrated in
Figure 4, where the whole process is shown.
Double branch: For the given feature map of the CNN branch
and the given feature map of the multi-head self-attention branch
we conduct two transformations
and
with the CNN branch and the MHSA branch, respectively. Note that
and
have different compositions, where
consists of efficient convolution, BatchNorm [
15], and ReLU [
26] in sequence.
consists of tuned convolution, MHSA, and activation layer in sequence.
Multi-stream: Some CNN residual blocks contain multiple streams, and to bring in the information of each stream, we use a new dimensional index k that denotes the number of CNN residual block streams.
is the sum of increments of representations up to k:
When
k = 1, the CNN residual block consists of the convolution of a single stream, and when
k > 1, the CNN residual block consists of multiple streams, each consisting of the convolution of the same or different kernel size.
Calculate the weights: First, we integrate information from each branch through summation. Then, we obtain global information by using global average pooling to generate channel-wise statistics. Specifically, we calculated the element of each channel and reshape the dimension from (h,w,c) to (s,c) by shrinking U through spatial dimensions H × W. We use a channel-wise parameter
to represent it
Further, we set
to represent the result of the transformation. This is achieved by the full connected layer, and we use two convolution operations to reduce the dimensions for efficiency. The transformation is formulated as follows:
Here,
and
are two convolution transformations, and rel is the ReLU function [
26],
denotes the Batch Normalization [
15] that can be learned to capture the importance of each channel. We use r to denote the dimensionality reduction multiplier, and the actual number of channels for Batch Normalization and ReLU function
. Then, we flatten
to (k + 1) dimensions for the next soft attention operation. The dimension of
is (k + 1,s,c)
For ease of expression, we will not distinguish between CNN streams and Transformer streams. We use
for each stream, where the first k streams are CNN streams and the last stream is a Transformer stream.
Mapping determines the weight of each stream for the c-th channel based on .
Fuse: The final feature map
V is obtained by passing the soft attention weights for each stream. To facilitate the distinction between the streams of CNN and Transformer, we show the first
k streams (CNN) and the last streams (Transformer) in Equation (
6).
The output of the final feature fusion is
.
3.2. DWNet Uses ResNet as the CNN Backbone (DWNet-R)
The DWNet-R model, which employs ResNet as the backbone, is composed of four parts. The first part is the CNN backbone, the second part is the CNN branch, the third part is the Transformer branch, and the fourth part is the FFG that connects these two branches. The entire model is referred to as DWNet-R, and layer four of DWNet-R can be observed in
Figure 3. The stem component of DWNet is similar to ResNet, and both utilize the feature pyramid structure. The benefit of this structure is that the size of the feature map is reduced while the number of channels increases with each layer, thereby enhancing the feature extraction capability. Taking a cue from the ResNet50 structure, the entire structure can be divided into four layers. The first layer applies a 7 × 7 convolution and max pool technique, while the second through fourth layers comprise a varying number of bottlenecks, each of which contains two 1 × 1 convolutions to reduce computation and regulate the number of channels, and a 3 × 3 convolution. Finally, the output of each bottleneck is added to the input as a residual connection.
CNN Branch:
The CNN branch is consistent with the fifth layer of ResNet50 and consists of several bottlenecks (three in ResNet50).
Transformer Branch:
We use the multi-headed attention mechanism directly in the Transformer block of our DWNet model instead of in a separate component like ViT [
7]. This block comprises a multi-head self-attention module, a down-projection fc layer, an up-projection fc layer, as well as LayerNorms that are implemented before each layer of both the fc layers and the self-attention module. In addition, we consider that the 3 × 3 convolution of the CNN branch has the ability to extract spatial location information and local features [
27], which is similar to the position embedding technique employed in ViT, so we do not use the position embedding technique employed in ViT on the Transformer branch for the sake of streamlining the model.
3.3. DWNet Uses OSNet as the CNN Backbone (DWNet-O)
OSNet is an omni-scale feature learning network explicitly designed for the re-ID task. Similar to ResNet, OSNet includes multiple residual blocks, the exceptional attribute of which is their ability to capture features at various scales using multiple convolutional streams. To dynamically fuse the multi-scale features, OSNet introduces an aggregation gate.
OSNet leverages convolution operations of different core sizes to obtain features of various scales, including the use of multiple stacked 3 × 3 convolutions to perform 5 × 5 convolution operations. This powerful multi-scale feature extraction capability of OSNet allows us to achieve better performance on ReID tasks at a lower cost.
We directly apply the multi-head self-attention and FFG in the residual blocks, removing the unified aggregation gate. The original multi-scale convolution is concatenated to the FFG in the form of multiple branches and MHSA. To improve the model’s performance, we replaced the residual blocks in the first layer of the middle three layers of the original OSNet (which comprises two residual blocks per layer) with new residual blocks.
The resulting model is referred to as DWNet-O, whose conv1 is depicted in
Figure 2.
5. Discussion
5.1. Experimental Results and Analysis
ResNet incorporates a variable number of bottlenecks in each of its four convolutional layers. According to the experimental data in the second step, replacing bottlenecks in layer1, layer2, layer3, and layer4 of the original ResNet with a bottleneck incorporating the Transformer enhances model performance. However, embedding the Transformer multiple times—for instance, replacing the p bottleneck in layer1 and layer2 simultaneously—yields no better performance results than embedding the Transformer once.
Figure 6 and
Figure 7 shows the feature map output of using FFG to embed the Transformer. When using FFG to fuse the feature map, it can be clearly seen that using FFG to retain CNN-based feature maps is less than not using FFG (there are more feature maps with completely inactive and only partially activated points in FFG). Although FFG can better preserve long-distance feature dependencies, it will result in the preservation of fewer feature maps that contain local feature information, especially fine-grained feature information. This reduces the ability to extract local features in the output of the next layer, resulting in the loss of local feature information. Embedding the Transformer multiple times worsens this problem, so its multiple embeddings may perform slightly worse than using just one or two Transformer embeddings, although it is still better than no embedding.
The experimental results show that replacing the bottlenecks in the original OSNet in conv2, conv3, and conv4 with the bottleneck embedded in the Transformer can improve the performance of the model. We find that the effect is less pronounced the higher the layer replacement is performed, and even on the MSMT17 dataset, when the bottleneck in conv4 is replaced, the performance improvement of the model is minimal in comparison. Embedding the Transformer at higher layers is due to the smaller size of the mapped feature maps. The size of the feature maps output by conv2, conv3, conv4 are 64 × 32, 32 × 16, and 16 × 8, respectively. This indicates that the effect of extracting long-distance dependencies directly using multi-head self-attention is affected by the size of the input feature map, and the larger the size of the input feature map, the more information about long-distance feature dependencies that can be extracted. In addition, the effect of feature fusion gate (FFG) is related to the number of output feature maps in place. When the number of output feature maps is higher, the more feature maps that can retain different local features, and the lower the loss of local features, especially fine-grained features, due to embedded long-distance feature dependencies. These two points lead to the fact that the optimal location for using FFG is greatly influenced by the network structure of backbone, for example, the optimal location for using FFG in ResNet and OSNet is very different, and sufficient experiments need to be conducted on exactly where to use FFG.
In the OSNet ablation experiments, we found that the approach of replacing the original layers with layers parallel to CNN and Transformer, while performing poorly on the Market1501 and DukeMTMC-reID datasets, performed well on the MSMT17 dataset, outperforming all the approaches of replacing bottlenecks. This shows that the optimal DWNet structure for each CNN backbone can not be generalized and needs to be combined with specific real-world situations.
As show in
Table 6, we compare DWNet-R and DWNet-O with their original baselines, respectively. It can be seen that compared to the original baseline ResNet, DWNet-R slightly increases in the number of parameters and Flops, which is still within the acceptable range, and DWNet-O has almost no increase in the number of parameters and Flops, compared to the original baseline OSNet-O. We can find that DWNet is at a reasonable level in terms of the number of parameters, flops, and memory consumption, especially when applied to lightweight models—the increase in number of parameters and flops is less. The above work shows that DWNet is simple, efficient, and flexible. Compared with other ReID models that use Transformer, it has more advantages in terms of number of parameters, flops, and memory usage.
5.2. Ethical Considerations and Future Improvements for DWNet
First and foremost, we believe that moral and ethical considerations are paramount when dealing with aspects such as identification and data storage. Therefore, while developing the DWNet technology, we ensured that the datasets used were open-source, ethical, and free of legal issues. During our experiments, we complied with relevant ethical principles to ensure the security and privacy of the data and to prevent leakage of personal information. We also comply with relevant laws and regulations to ensure that our technology meets ethical and legal standards.
In practical security surveillance applications, models are often deployed on embedded devices. Given the performance constraints of these devices, backbone models such as ResNet and OSNet are typically used for pedestrian re-identification. DWNet, which offers comparable performance with reduced computational demands, can replace these backbone models without significantly increasing resource consumption while improving recognition accuracy. Furthermore, due to the high flexibility of DWNet, its structure can be adjusted to accommodate different environments. For instance, the DWNet structure that replaces the original layers of OSNet outperforms other structures on the MSMT17 dataset, which has a higher resolution than other datasets used in our experiments. As such, the DWNet structure can be employed in high-resolution camera scenes to enhance recognition rates.
In the future, we will continue to study the DWNet model structure to solve the problem that its dynamic weight parameters can only reach local optima but not global optima. This leads to a decrease in accuracy after embedding the transformer several times, as mentioned in our paper. We hope to make the dynamic weight parameters of DWNet globally optimal by increasing the losses at different stages so that embedding any number of transformers will only improve the accuracy of the model without degrading it.
6. Conclusions
The pure Transformer visual backbone architecture is computation intensive, so using CNN combined with self-attention visual backbone architecture, has become a popular field of research. To address the issues inherent in applying CNN combined with self-attention visual backbone architecture to the re-identification task, we propose a parallel framework based on feature fusion gates (FFG) for CNN combined with self-attention, called DWNet. Through ablation experiments, we demonstrate the general effectiveness of DWNet, and determined that different structures, DWNet-R and DWNet-O, improved performance compared to the original baseline while remaining computationally efficient. DWNet is simple, efficient, portable, and well-suited for large-scale industrial application scenarios. DWNet has the potential to serve as a backbone for re-identification tasks and can be easily combined with other methods to further improve the model accuracy.