Next Article in Journal
Research on a Fault Feature Extraction Method for an Electric Multiple Unit Axle-Box Bearing Based on a Resonance-Based Sparse Signal Decomposition and Variational Mode Decomposition Method Based on the Sparrow Search Algorithm
Previous Article in Journal
How Do We Calibrate a Battery Electric Vehicle Model Based on Controller Area Network Bus Data?
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MambaReID: Exploiting Vision Mamba for Multi-Modal Object Re-Identification

1
School of Computer and Information, Hohai University, Nanjing 211106, China
2
School of Mathematics and Statistics, Huaiyin Normal University, Huai’an 223300, China
3
School of Computer and Software, Nanjing Vocational University of Industry Technology, Nanjing 210023, China
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(14), 4639; https://doi.org/10.3390/s24144639
Submission received: 16 May 2024 / Revised: 26 June 2024 / Accepted: 15 July 2024 / Published: 17 July 2024
(This article belongs to the Section Sensing and Imaging)

Abstract

:
Multi-modal object re-identification (ReID) is a challenging task that seeks to identify objects across different image modalities by leveraging their complementary information. Traditional CNN-based methods are constrained by limited receptive fields, whereas Transformer-based approaches are hindered by high computational demands and a lack of convolutional biases. To overcome these limitations, we propose a novel fusion framework named MambaReID, integrating the strengths of both architectures with the effective VMamba. Specifically, our MambaReID consists of three components: Three-Stage VMamba (TSV), Dense Mamba (DM), and Consistent VMamba Fusion (CVF). TSV efficiently captures global context information and local details with low computational complexity. DM enhances feature discriminability by fully integrating inter-modality information with shallow and deep features through dense connections. Additionally, with well-aligned multi-modal images, CVF provides more granular modal aggregation, thereby improving feature robustness. The MambaReID framework, with its innovative components, not only achieves superior performance in multi-modal object ReID tasks, but also does so with fewer parameters and lower computational costs. Our proposed MambaReID’s effectiveness is validated by extensive experiments conducted on three multi-modal object ReID benchmarks.

1. Introduction

Object Re-identification (ReID) aims to re-identify the same object across different camera views. Due to its wide range of applications, object ReID [1,2,3] has advanced significantly in recent years. In particular, traditional object ReID mainly focuses on extracting discriminative information from easily accessible RGB images. However, various complex imaging conditions like darkness, strong lighting, and low image resolution may severely affect the quality of RGB images [4]. The critical regions of objects become blurred [5], resulting in a loss of discriminative information. Fortunately, multi-modal object ReID [6,7,8] has shown significant potential in overcoming the challenges. By integrating complementary information from near-infrared (NIR), thermal infrared (TIR), and RGB images, multi-modal object ReID can provide more robust representations in complex scenes. Thus, it has attracted increasing attention in the past few years.
In order to aggregate heterogeneous information from different modalities, Li et al. [5] first propose a multi-modal vehicle ReID benchmark named RGBNT100, which contains RGB, NIR, and TIR images. Meanwhile, they propose a HAMNet to learn the robust representations with a heterogeneous coherence loss. Further, Zheng et al. [4] propose a PFNet to progressively fuse multi-modal features, along with the first multi-modal person ReID benchmark named RGBNT201. Wang et al. [9] employ three learning methods to enhance modality-specific knowledge with the IEEE framework. Then, Zheng et al. [6] introduce a DENet to tackle the modality-missing issue. With generative models, Guo et al. [10] present a GAFNet that fuses heterogeneous information. Meanwhile, with the generalization ability of Transformers [11], researchers begin to explore the potential of Transformers in multi-modal object ReID. Pan et al. [12] construct a PHT, utilizing a feature hybrid mechanism to balance information from different modalities. Through analyzing the modality of laziness, Jennifer et al. [13] provide a strong baseline UniCat. Further, Wang et al. [7] introduce a new token permutation mechanism designed to enhance the robustness of multi-modal object ReID. From the perspective of test-time training, Wang et al. [14] propose a HTT to explore the information existing in the test data. Recently, Zhang et al. [8] construct an EDITOR to select diverse features and minimize the influence of background noise. Although these methods achieve promising results, they still have some limitations.
For CNN-based methods, their performance is hindered by limited receptive fields, making them difficult to capture global information from heterogeneous modalities. As for Transformer-based methods, while they exhibit superior performance, the quadratic computational complexity [11] introduced by attention mechanisms is unacceptable. Additionally, the lack of convolutional inductive bias [15] in Transformer-based methods results in weaker perception of local details, leading to the neglect of certain information among different modalities. Thus, efficient integration of global and local information becomes crucial for multi-modal object ReID. However, existing methods may fail to exploit the complementary advantages of the above frameworks. On the one hand, the features extracted by existing methods lack sufficient robustness. On the other hand, the most effective current approaches, which employ highly complex Transformer models, are not efficient. Therefore, there is an urgent need for an efficient method capable of extracting robust features.
Recently, Mamba [16] has drawn significant attention [17] due to its superior scalability with State Space Models (SSM) [18]. Empowered by the Selection Mechanism (S6), Mamba surpasses alternative state-of-the-art architectures, like CNNs or Transformers, in managing long sequences with linear complexity. Furthermore, Mamba has been successfully applied to various computer vision tasks [19,20,21], such as image classification, object detection, and video understanding. To be specific, VMamba [19] integrates S6 with a four directions scanning mechanism, which can fully capture the global context information and local details. Drawing inspiration from VMamba’s exceptional performance in image classification tasks, we introduce a novel fusion framework named MambaReID, specifically designed for multi-modal object ReID. This innovative approach leverages the strengths of various modalities to enhance the re-identification process, aiming to significantly improve accuracy and efficiency in multi-modal object ReID scenarios.
Technically, our proposed MambaReID consists of three main components: Three-Stage VMamba (TSV), Dense Mamba (DM), and Consistent VMamba Fusion (CVF). More specifically, TSV is designed to extract robust multi-modal representations with four directions scanning mechanism. Instead of directly transferring VMamba, we observe that its final stage downsampling leads to substantial detail loss and computational redundancy. For tasks like multi-modal ReID, high-resolution feature maps provide more details for subsequent modality fusion. Hence, we opt to skip the final stage and adopt a last stride technique, as described in BoT [22]. This adaptation allows TSV to preserve richer details while reducing computational overhead, resulting in more robust outcomes compared to Transformers. Then, we introduce a DM into the TSV to further enhance the discriminative ability. Dense connections are crucial for fine-grained image classification tasks [23], as they offer semantic information at different levels, thereby enhancing the robustness of features. With the dense connections, DM can fully integrate the inter-modallity information with shallow and deep features. Different from previous dense connections, we only introduce that in the last stage of TSV with a small computational overhead. Hence, MambaReID can retrain more fine-grained details with less computational cost. Finally, to effectively integrate information from multiple modalities, we introduce the CVF. CVF incorporates a consistent loss function to align deep features across different modalities. This loss ensures that features from different modalities are well-aligned, facilitating effective fusion. Aligned features are then concatenated along the channel dimension and processed by the VMamba block for modality integration. This step enables simultaneous integration of information from multiple modalities at identical spatial positions, thereby enhancing the granularity of modal aggregation. Through our method, CVF ensures the effective utilization of well-aligned multi-modal images. Finally, with our proposed method, MambaReID can provide more robust multi-modal features for multi-modal ReID. Extensive experiments on three public benchmarks demonstrate that our MambaReID can achieve superior performance compared to most state-of-the-art methods.
In summary, our contributions are as follows:
  • We introduce a novel fusion framework named MambaReID. This work marks the first attempt at integrating Mamba into multi-modal object ReID.
  • We propose the Three-Stage VMamba (TSV) to extract robust multi-modal representations. TSV efficiently captures both global context information and local details, while maintaining low computational complexity.
  • We introduce Dense Mamba (DM) to seamlessly integrate inter-modality information across shallow and deep features. Moreover, we present Consistent VMamba Fusion (CVF) to fuse deep features originating from diverse modalities. Leveraging well-aligned multi-modal images, CVF refines modal aggregation granularity, consequently enhancing feature discriminability.

2. Proposed Methodology

2.1. Overall Architecture of MambaReID

As shown in Figure 1, our MambaReID consists of three main components: Three-Stage VMamba (TSV), Dense Mamba (DM), and Consistent VMamba Fusion (CVF). We employ the TSV as the backbone. It is designed to extract robust single-modal representations from RGB, NIR, and TIR modalities. DM is introduced at the final stage of TSV to further improve network discriminative capability. Additionally, with the well-aligned multi-modal images, CVF is employed to enhance the modal aggregation granularity. Detailed descriptions of our proposed modules are provided in the following sections.

2.2. Preliminary

State Space Models (SSMs). SSMs [24,25] have been widely employed across diverse sequential data modeling tasks. Inspired by continuous systems, SSMs adeptly capture the dynamic patterns inherent in input sequences. To be specific, SSMs can map the input x ( t ) R L to the output y ( t ) R L , which is expressed as:
h ˙ ( t ) = A h ( t ) + B x ( t ) ,
y ( t ) = C h ( t ) + D x ( t ) ,
where A C N × N , B C N and C C N are the model parameters, while the D C is the residual term. Here, N is the dimension of the hidden state h ( t ) . The above equations can easily model the continuous input. However, in the context of a discrete input like image or text, SSM needs to be discretized with the zero-order hold (ZOH) method. To specify, where Δ R D represents the predefined timescale parameter that maps the continuous parameters A and B into a discrete space, the discretization process is described as:
A ¯ = e Δ A ,
B ¯ = ( e Δ A I ) A 1 Δ B ,
C ¯ = C ,
where B , C R D × N are the discretized matrices. After discretization, the SSM is calculated as follows:
h k = A ¯ h k 1 + B ¯ x k
y k = C ¯ h k + D ¯ x k .
Finally, with the discretized SSM, we can model the discrete image with linear complexity.
Figure 1. The overall architecture of our MambaReID. First, the images form RGB, NIR and TIR modalities are fed into the backbone TSV. With the lightweight TSV, we can extract robust multi-modal representations from different modalities. Then, in the last stage of TSV, DM is utilized to further intergrate the information from both shallow and deep features. Finally, with the well-aligned multi-modal images, CVF is employed to enhance the modal aggregation granularity at the same spatial positions. Thanks to the proposed modules, our MambaReID generates more discriminative multi-modal information with low computational complexity.
Figure 1. The overall architecture of our MambaReID. First, the images form RGB, NIR and TIR modalities are fed into the backbone TSV. With the lightweight TSV, we can extract robust multi-modal representations from different modalities. Then, in the last stage of TSV, DM is utilized to further intergrate the information from both shallow and deep features. Finally, with the well-aligned multi-modal images, CVF is employed to enhance the modal aggregation granularity at the same spatial positions. Thanks to the proposed modules, our MambaReID generates more discriminative multi-modal information with low computational complexity.
Sensors 24 04639 g001
Selective Scan Mechanism. SSM can be utilized for efficient sequence modeling. However, the conventional SSM may fail to capture the complex patterns in various input sequences. Without a data-dependent structure, SSM lacks the ability to focus on or ignore specific information. To solve this issue, Gu et al. [16] present a Selective Scan mechanism for SSM (S6), where the matrices B R L × N , C R L × N , and Δ R L × D are generated by the input data x R L × D . This enables S6 to fully perceive the contextual information of the input, rendering it more flexible and efficient.
2D Selective Scan. As shown in the first image of Figure 2, the S6 usually scans the input sequence in one direction. However, in the context of visual tasks, the sequences we encounter often consist of non-causal image data. Thus, the unidirectional scanning method employed in S6 is deemed impractical. Within single direction scanning, the current image patch can only perceive the information from the previous patches, instead of the local information from different directions. To address this issue, VMamba [19] introduces a 2D Selective Scan (SS2D) mechanism. To be specific, SS2D scans the input sequence in four directions: left-to-right, right-to-left, top-to-bottom, and bottom-to-top. As shown in Figure 2, the patch in a specific region can perceive the information from adjacent patches in different directions, which can enhance the feature discriminability with contextual information. Thus, in our MambaReID, we employ the SS2D to extract robust multi-modal representations. Specifically, we only need to unfold the images in different sequences. After scanning, we restore them according to their original relative positions, and then sum the scanning results from the four directions at the same position.

2.3. Three-Stage VMamba

To fully exploit the discriminative information with low computational complexity, we first introduce the VMamba as our backbone. Generally, the VMamba is composed of four stages with three downsampling operations. However, the downsampling of the last stage in VMamba leads to a significant loss of detailed information. Moreover, the final stage introduces redundant computational costs. Therefore, we eliminate the last downsampling like BoT [22] to preserve richer details. Further, we directly remove the final stage to achieve more efficient modeling. Thus, we propose the Three-Stage VMamba (TSV) as our backbone, as shown in Figure 1.
To be specific, we denote the input multi-modal images as X = { X RGB , X NIR , X TIR R H × W × C }, where H, W, and C denote the height, width, and channel of the images, respectively. For illustrative purposes and without loss of generality, we consider the RGB modality as a representative example. The RGB images X RGB are first fed into the Stem block with a convolutional layer to extract the initial features F RGB R H 4 × W 4 × C 1 . Then, F RGB are fed into the VSS block to integrate the global and local information. Technically, the VSS block consists of LayerNorm (LN) [26], Linear transformation, Depthwise Convolution (DWConv) [27], SS2D and SiLU [28] activation functions. As shown in the right corner of Figure 1, the VSS block can be formulated as:
Ψ i ( X ) = Linear i ( LN ( X ) ) , i { 1 , 2 } ,
Ω ( X ) = LN ( SS 2 D ( DWConv ( Ψ 1 ( X ) ) ) ) ,
VSS ( X ) = Linear ( SiLU ( Ω ( X ) ) × SiLU ( Ψ 2 ( X ) ) ) .
Then, with the residual connection, we fed the output of the current VSS block into the next VSS block as follows:
X l = X l 1 + VSS l ( X l 1 ) ,
where X l denotes the output of the l-th VSS block. Thus, after the first stage of TSV, we can obtain the features F RGB 1 R H 4 × W 4 × C 1 . Then, the features F RGB 1 are fed into the next stage for further processing. Similarly, we can obtain the features F RGB 2 R H 8 × W 8 × C 2 . Finally, with the last stage of TSV, we obtain the features F RGB 3 R H 16 × W 16 × C 3 . Similar to the RGB modality, we can also extract the features F NIR 3 and F TIR 3 . Then, we employ the pooling (P) on F RGB 3 , F NIR 3 and F TIR 3 , respectively. Finally, we concatenate the pooled features to obtain the multi-modal features f t s v as follows:
f t s v = [ P ( F RGB 3 ) , P ( F NIR 3 ) , P ( F TIR 3 ) ] ,
where [ · ] is the concatenation operation. After applying the loss supervision on f t s v , we can extract the robust multi-modal representations with TSV.

2.4. Dense Mamba

Dense connections [29] have been extensively utilized in various computer vision tasks. They can effectively enhance the feature discriminability by fully integrating the information with shallow and deep features. However, directly introducing dense connections into the TSV will lead to a significant increase in complexity. Additionally, the effectiveness of dense connections in VMamba remains unclear. Therefore, we explore the potential of dense connections in VMamba and introduce the Dense Mamba (DM) to the last stage of TSV. As shown in the right corner of Figure 1, we simplify the dense connections by directly adding the previous features to the current features. To be specific, the DM can be formulated as:
X l = X ¯ l 1 + VSS l ( X ¯ l 1 ) ,
X ¯ l = 1 l i = 1 l X i ,
where l denotes the l-th VSS block in the last stage of TSV. With the simple dense connections, we can fully integrate the information from different levels within a single modality. Meanwhile, the computational overhead is extremely low.

2.5. Consistent VMamba Fusion

To fully exploit the complementary information from different modalities, we introduce the Consistent VMamba Fusion (CVF). With well-aligned multi-modal images, we employ consistent loss to integrate the deep features across modalities at the same spatial positions. Furthermore, we concatenate the aligned features along the channel dimension and feed them into the VSS blocks to achieve modality fusion. The detailed process of CVF is illustrated in the right corner of Figure 1. We first directly copy the features F RGB 3 , F NIR 3 and F TIR 3 to obtain F R , F N and F T , respectively. Then, we utilize a linear transformation to reduce the channel dimension of F R , F N and F T to F R , F N and F T R H 16 × W 16 × C 3 3 :
F R = Linear ( F R ) ,
F N = Linear ( F N ) ,
F T = Linear ( F T ) .
Then, we employ the consistent loss to align the features F R , F N and F T with the following equations:
L R N = | | F R F N | | 2 2 ,
L R T = | | F R F T | | 2 2 ,
L N T = | | F N F T | | 2 2 .
We end up with the consistency constraint loss L C , which can be represented as:
L C = 1 N p ( L R N + L R T + L N T ) ,
where N p is the number of patches. After obtaining the aligned features, we concatenate them along the channel dimension to obtain the multi-modal features F 0 R H 16 × W 16 × C 3 as follows:
F 0 = [ F R , F N , F T ] ,
Then, we feed the F 0 into the stacked K layers of VSS blocks:
F l = F l 1 + VSS l ( F l 1 ) , l { 1 , . . . , K } ,
where F l denotes the output of the l-th VSS block in the CVF.
f c v f = P ( F K ) ,
Finally, we pool the output of the last VSS block to obtain the features f c v f R C 3 . Subsequently, the features f c v f are sent for loss supervision. With the fully integrated multi-modal features, we can obtain more discriminative representations.

2.6. Objective Function

As depicted in Figure 1, our objective function comprises three parts: losses for the TSV backbone, CVF, and the consistency constraint loss L C . Both the backbone and CVF are supervised by the label-smoothing cross-entropy ID loss L ID [30] and triplet loss L Triplet [31]:
L TSV = L ID ( f t s v ) + L Triplet ( f t s v ) ,
L CVF = L ID ( f c v f ) + L Triplet ( f c v f ) .
The overall objective function is defined as:
L = L TSV + L CVF + λ L C ,
where λ is the hyperparameter used to balance L C . By minimizing the L , our MambaReID can generate more discriminative multi-modal features with low computational complexity.

3. Experiments

3.1. Dataset and Experimental Setup

We utilize three multi-modal object ReID benchmarks to evaluate the performance of our proposed MambaReID. Particularly, RGBNT201 [4] is the inaugural multi-modal dataset for person re-identification, which includes RGB, NIR, and TIR modalities. RGBNT100 [5] is a large-scale multi-modal vehicle ReID dataset, while MSVR310 [32] is a smaller-scale multi-modal vehicle ReID dataset featuring complex visual scenes. Regarding the evaluation metrics, we align with prior studies and adopt mean Average Precision (mAP) and Cumulative Matching Characteristics (CMC) at Rank-K ( K = 1 , 5 , 10 ). Additionally, we report the trainable parameters and FLOPs for complexity analysis.

3.2. Implementation Details

We leverage pre-trained VMamba [19], sourced from the ImageNet classification dataset [33], as the backbone of our architecture. The images of RGBNT201 were resized to 256 × 128 and the images of RGBNT100 and MSVR310 datasets were resized to 128 × 256 during data processing. In the training process, we enhance the robustness of our model by applying data augmentation techniques such as random horizontal flipping, cropping, and erasing [34]. The training process is managed with a mini-batch size of 64, consisting of eight randomly selected identities, each providing eight images. Optimization of our model is achieved through the use of Stochastic Gradient Descent (SGD) with a momentum of 0.9 and a weight decay of 0.0001. The learning rate is initially set to 0.01 and a cosine decay warm-up strategy is applied. The hyperparameter λ in L is set to 1. For the CVF, we set the K to 1. During testing, we concatenate the features f t s v and f c v f to obtain the final multi-modal features for retrieval. Specifically, each query consists of three paired images: RGB, NIR, and TIR. The model takes these three modalities as input and extracts the final retrieval vector according to the network structure, which serves as the feature of the query. A similar process is applied to each triplet in the gallery. Finally, by ranking the similarity between the features of the query and those in the gallery, the model determines whether the object is correctly matched. The proposed method is implemented using the PyTorch framework and experiments are conducted on a single NVIDIA A100 GPU (NVIDIA, Santa Clara, CA, USA).

3.3. Comparisons With State-of-the-Art Methods

Multi-modal Person ReID. As reported in Table 1a, we conduct comprehensive comparisons with dominant single-modal and multi-modal approaches on RGBNT201. Typically, single-modal approaches perform poorly due to the lack of specialized designs for multi-modal fusion. Among single-modal methods, PCB demonstrates a notable mAP of 32.8%. For multi-modal methods, CNN-based frameworks demonstrate inferior performance compared to Transformer-based ones due to their limited receptive fields. With the strong generalization ability of Transformers, HTT [14], TOP-ReID [7] and EDITOR [8] achieve superior performance. Specifically, TOP-ReID (A) achieves a mAP of 72.3%, surpassing HTT by 1.2%. However, the high computational complexity of Transformers is unacceptable. Based on VMamba, with only 18.32% of the TOP-ReID’s parameters, our MambaReID achieves a mAP of 72.2%. Additionally, our MambaReID outperforms most CNN-based and Transformer-based methods across various settings (A/B settings), thereby validating the efficacy of our approach.
Multi-modal Vehicle ReID. As depicted in Table 1b, we compare our MambaReID with mainstream methods on RGBNT100 and MSVR310. In single-modal methods, BoT [22] achieves a mAP of 78.0% on RGBNT100. Transformer-based method TransReID [45] underperforms CNN-based methods due to the lack of convolutional inductive bias. Especially on the small-scale dataset MSVR310, TransReID achieves a mAP of 18.4%, which is 10.5% lower than AGW [1]. However, in multi-modal methods, Transformer-based methods like TOP-ReID [7] and EDITOR [8] exhibit superior performance in integrating multi-modal information. Focusing on mitigating the influence of irrelevant background, EDITOR achieves a mAP of 82.1% on RGBNT100. With only 50.2% trainable parameters of EDITOR, our MambaReID achieves a competitive mAP of 78.6% on RGBNT100. Additionally, on MSVR310, our MambaReID (B) achieves a mAP of 46.1%, surpassing EDITOR (B) by 7.1%. Overall, these results fully demonstrate the effectiveness of our proposed method.

3.4. Ablation Study

Effect of different components. In Table 2, we conduct ablation studies to evaluate the effectiveness of different components on RGBNT201. Model A is the baseline model with only TSV. With the introduction of CVF, model B achieves a mAP of 68.03%, surpassing model A by 4.22%. Through introducing DM, model C achieves a mAP of 68.07%, surpassing model A by 4.26%. Finally, with both DM and CVF, our MambaReID achieves a mAP of 72.20%, surpassing model B and model C by 4.17% and 4.13%, respectively. As for complexity, model B and model C only introduce a small computational overhead. Especially for model C, the computational overhead is negligible. These results fully authenticate the effectiveness of our proposed components.
Parameter analysis. In Table 3, we compare the trainable parameters of our MambaReID with mainstream methods on RGBNT100. Compared to TOP-ReID [7] and EDITOR [8], our MambaReID achieves a competitive mAP of 78.6% on RGBNT100 with only 59.47 M parameters. Additionally, our MambaReID outperforms most CNN-based methods with a similar number of parameters. This clearly illustrates the efficiency and effectiveness of our proposed approach.
Effect of different backbones. In Table 4, we compare the performance of different backbones on RGBNT201. ViT achieves a mAP of 63.18%, while directly using VMamba achieves a mAP of 55.98%. This result indicates that the final stage downsampling of VMamba leads to substantial detail loss. With the last stride trick in BoT [22], VMamba achieves a mAP of 58.47%. However, the high computational complexity of VMamba is unacceptable. Thus, we directly drop the final stage of VMamba and adopt a last stride technique, resulting in a mAP of 63.81%. With only 63.96% of the parameters and 79.22% FLOPs of ViT, our TSV achieves a mAP of 63.81%, surpassing ViT by 0.63%. Finally, we employ the TSV as our backbone in subsequent experiments. These outcomes affirm the efficacy of our proposed backbone structure.
Effect of different depths of CVF. In Table 5, we evaluate the performance of different depths of CVF on RGBNT201. With the increase in depth, the performance of CVF gradually decreases. This result indicates that a deeper CVF may introduce more irrelevant information, leading to performance degradation. Hence, we opt to use a CVF with a depth of 1 in the other experiments.
Effect of different DMs. In Table 6, we evaluate the performance of different DMs on RGBNT201. Specifically, “Last” refers to whether the output of TSV corresponds to the output of the last block or the sum of all blocks in the last stage. “Freq” refers to the frequency of dense connection is introduced in the last stage. When “Freq” is set to 2, the dense connections are introduced every two blocks in the last stage. Comparing model A and model B, integrating the output of all blocks in the last stage can introduce more fine-grained details. Additionally, with the “Freq” set to 2, model C achieves a mAP of 66.04%, surpassing model A by 2.53%. Finally, model D achieves the best performance with a mAP of 68.07%.

3.5. Visualization Analysis

In Figure 3, we visualize the feature distributions of different modules on RGBNT201. In Figure 3a, the original VMamba’s extracted features for the same ID are widely dispersed, resulting in poor discrimination. Comparing Figure 3a,b, the introduction of TSV leads to tighter feature grouping within IDs and increased dispersion between different IDs. Additionally, the inclusion of DM results in a more compact feature distribution in Figure 3c. Comparing Figure 3d with Figure 3c, the introduction of CVF increases the spacing between indistinguishable IDs while reducing intra-ID spacing. These visualizations provide strong evidence of the efficacy and superiority of our modules.

4. Conclusions

In this study, we introduce a novel fusion framework, MambaReID, designed for multi-modal object ReID. We are the first to explore the potential of Mamba in multi-modal object ReID, and we find that its final stage critically disrupts the ReID features. Thus, MambaReID utilizes a Three-Stage VMamba (TSV) architecture to derive robust multi-modal representations that are rich in detail yet require lower computational resources. To enhance its discriminative power, we incorporate a Dense Mamba (DM) module within the TSV to fully exploit various levels of semantic features. Furthermore, leveraging well-aligned multi-modal images, we implement a Consistent VMamba Fusion (CVF) technique aimed at refining the granularity of modal integration. Comprehensive testing across three public benchmarks validates the superior performance and efficiency of our framework.

Author Contributions

Conceptualization, R.Z.; Methodology, L.X.; Validation, L.W.; Investigation, S.Y.; Writing—review & editing, R.Z. and L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Science Foundation of Jiang Su Higher Education Institutions, grant number 24KJD510005 and Jiang Su Province Industry-University-Research Project, grant number BY20230694.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S. Deep learning for person re-identification: A survey and outlook. TPAMI 2021, 44, 2872–2893. [Google Scholar] [CrossRef]
  2. Ye, M.; Chen, S.; Li, C.; Zheng, W.; Crandall, D.; Du, B. Transformer for Object Re-Identification: A Survey. arXiv 2024, arXiv:2401.06960. [Google Scholar]
  3. Amiri, A.; Kaya, A.; Keceli, A. A Comprehensive Survey on Deep-Learning-based Vehicle Re-Identification: Models, Data Sets and Challenges. arXiv 2024, arXiv:2401.10643. [Google Scholar]
  4. Zheng, A.; Wang, Z.; Chen, Z.; Li, C.; Tang, J. Robust multi-modality person re-identification. Proc. AAAI Conf. Artif. Intell. 2021, 35, 3529–3537. [Google Scholar] [CrossRef]
  5. Li, H.; Li, C.; Zhu, X.; Zheng, A.; Luo, B. Multi-spectral vehicle re-identification: A challenge. Proc. AAAI Conf. Artif. Intell. 2020, 34, 11345–11353. [Google Scholar] [CrossRef]
  6. Zheng, A.; He, Z.; Wang, Z.; Li, C.; Tang, J. Dynamic Enhancement Network for Partial Multi-modality Person Re-identification. arXiv 2023, arXiv:2305.15762. [Google Scholar]
  7. Wang, Y.; Liu, X.; Zhang, P.; Lu, H.; Tu, Z.; Lu, H. TOP-ReID: Multi-spectral Object Re-Identification with Token Permutation. Proc. AAAI Conf. Artif. Intell. 2024, 38, 5758–5766. [Google Scholar] [CrossRef]
  8. Zhang, P.; Wang, Y.; Liu, Y.; Tu, Z.; Lu, H. Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
  9. Wang, Z.; Li, C.; Zheng, A.; He, R.; Tang, J. Interact, embed, and enlarge: Boosting modality-specific representations for multi-modal person re-identification. Proc. AAAI Conf. Artif. Intell. 2022, 36, 2633–2641. [Google Scholar] [CrossRef]
  10. Guo, J.; Zhang, X.; Liu, Z.; Wang, Y. Generative and attentive fusion for multi-spectral vehicle re-identification. In Proceedings of the 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 15–17 April 2022; pp. 1565–1572. [Google Scholar]
  11. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. NIPS 2017, 30, 6000–6010. [Google Scholar]
  12. Pan, W.; Huang, L.; Liang, J.; Hong, L.; Zhu, J. Progressively Hybrid Transformer for Multi-Modal Vehicle Re-Identification. Sensors 2023, 23, 4206. [Google Scholar] [CrossRef] [PubMed]
  13. Crawford, J.; Yin, H.; McDermott, L.; Cummings, D. UniCat: Crafting a Stronger Fusion Baseline for Multimodal Re-Identification. arXiv 2023, arXiv:2310.18812. [Google Scholar]
  14. Wang, Z.; Huang, H.; Zheng, A.; He, R. Heterogeneous Test-Time Training for Multi-Modal Person Re-identification. Proc. AAAI Conf. Artif. Intell. 2024, 38, 5850–5858. [Google Scholar] [CrossRef]
  15. Lu, Z.; Xie, H.; Liu, C.; Zhang, Y. Bridging the gap between vision transformers and convolutional neural networks on small datasets. Adv. Neural Inf. Process. Syst. 2022, 35, 14663–14677. [Google Scholar]
  16. Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
  17. Yang, Y.; Xing, Z.; Zhu, L. Vivim: A Video Vision Mamba for Medical Video Object Segmentation. arXiv 2024, arXiv:2401.14168. [Google Scholar]
  18. Smith, J.T.; Warrington, A.; Linderman, S.W. Simplified state space layers for sequence modeling. arXiv 2022, arXiv:2208.04933. [Google Scholar]
  19. Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. Vmamba: Visual state space model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
  20. Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
  21. Wan, Z.; Wang, Y.; Yong, S.; Zhang, P.; Stepputtis, S.; Sycara, K.; Xie, Y. Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation. arXiv 2024, arXiv:2404.04256. [Google Scholar]
  22. Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  23. Zhao, B.; Feng, J.; Wu, X.; Yan, S. A survey on deep learning-based fine-grained object classification and semantic segmentation. Int. J. Autom. Comput. 2017, 14, 119–135. [Google Scholar] [CrossRef]
  24. Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
  25. Gu, A.; Johnson, I.; Goel, K.; Saab, K.; Dao, T.; Rudra, A.; Ré, C. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Adv. Neural Inf. Process. Syst. 2021, 34, 572–585. [Google Scholar]
  26. Ba, J.; Kiros, J.; Hinton, G. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
  27. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
  28. Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 2018, 107, 3–11. [Google Scholar] [CrossRef] [PubMed]
  29. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  30. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
  31. Hermans, A.; Beyer, L.; Leibe, B. In defense of the triplet loss for person re-identification. arXiv 2017, arXiv:1703.07737. [Google Scholar]
  32. Zheng, A.; Zhu, X.; Ma, Z.; Li, C.; Tang, J.; Ma, J. Multi-spectral vehicle re-identification with cross-directional consistency network and a high-quality benchmark. arXiv 2022, arXiv:2208.00632. [Google Scholar]
  33. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
  34. Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random erasing data augmentation. Proc. AAAI Conf. Artif. Intell. 2020, 34, 13001–13008. [Google Scholar] [CrossRef]
  35. Qian, X.; Fu, Y.; Jiang, Y.G.; Xiang, T.; Xue, X. Multi-scale deep learning architectures for person re-identification. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5399–5408. [Google Scholar]
  36. Li, W.; Zhu, X.; Gong, S. Harmonious attention network for person re-identification. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2285–2294. [Google Scholar]
  37. Chang, X.; Hospedales, T.M.; Xiang, T. Multi-level factorisation net for person re-identification. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2109–2118. [Google Scholar]
  38. Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the ECCV, Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
  39. Zhou, K.; Yang, Y.; Cavallaro, A.; Xiang, T. Omni-scale feature learning for person re-identification. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3702–3712. [Google Scholar]
  40. Rao, Y.; Chen, G.; Lu, J.; Zhou, J. Counterfactual attention learning for fine-grained visual categorization and re-identification. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 1025–1034. [Google Scholar]
  41. Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the MM ’18: Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 274–282. [Google Scholar]
  42. Chen, G.; Zhang, T.; Lu, J.; Zhou, J. Deep meta metric learning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9547–9556. [Google Scholar]
  43. Sun, Y.; Cheng, C.; Zhang, Y.; Zhang, C.; Zheng, L.; Wang, Z.; Wei, Y. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6398–6407. [Google Scholar]
  44. Zhao, J.; Zhao, Y.; Li, J.; Yan, K.; Tian, Y. Heterogeneous relational complement for vehicle re-identification. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 205–214. [Google Scholar]
  45. He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; Jiang, W. Transreid: Transformer-based object re-identification. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 15013–15022. [Google Scholar]
  46. Yin, H.; Li, J.; Schiller, E.; McDermott, L.; Cummings, D. GraFT: Gradual Fusion Transformer for Multimodal Re-Identification. arXiv 2023, arXiv:2310.16856. [Google Scholar]
  47. He, Q.; Lu, Z.; Wang, Z.; Hu, H. Graph-Based Progressive Fusion Network for Multi-Modality Vehicle Re-Identification. IEEE Trans. Intell. Transp. Syst. 2023, 24, 12431–12447. [Google Scholar] [CrossRef]
  48. Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Figure 2. The visualization of 2D Selective Scan (SS2D) mechanism based on TIR images. From left to right, the SS2D scans the input image in four directions for comprehensive information perception. Different directions of the red arrows represent different scanning directions.
Figure 2. The visualization of 2D Selective Scan (SS2D) mechanism based on TIR images. From left to right, the SS2D scans the input image in four directions for comprehensive information perception. Different directions of the red arrows represent different scanning directions.
Sensors 24 04639 g002
Figure 3. Feature distributions with t-SNE [48]. Different colors represent different IDs.
Figure 3. Feature distributions with t-SNE [48]. Different colors represent different IDs.
Sensors 24 04639 g003
Table 1. Comparative analysis of three multi-modal object ReID benchmarks, with the top and second-best performances highlighted in bold and underlined, respectively. The symbol * represents the Transformer-based methods, while † denotes the Mamba-based methods. For the rest of the methods, they are CNN-based methods. The A and B denote the different settings from TOP-ReID [7].
Table 1. Comparative analysis of three multi-modal object ReID benchmarks, with the top and second-best performances highlighted in bold and underlined, respectively. The symbol * represents the Transformer-based methods, while † denotes the Mamba-based methods. For the rest of the methods, they are CNN-based methods. The A and B denote the different settings from TOP-ReID [7].
(a) Comparison on RGBNT201
MethodsRGBNT201
ModalityVenueModelmAPR-1R-5R-10
SingleICCV’17MUDeep [35]23.819.733.144.3
CVPR’18HACNN [36]21.319.034.142.8
CVPR’18MLFN [37]26.124.235.944.1
ECCV’18PCB [38]32.828.137.446.9
ICCV’19OSNet [39]25.422.335.144.7
ICCV’21CAL [40]27.624.336.545.7
MultiAAAI’20HAMNet [5]27.726.341.551.7
AAAI’21PFNet [4]38.538.952.058.4
AAAI’22IEEE [9]49.548.459.165.6
ArXiv’23DENet [6]42.442.255.364.5
NeurIPSW’23UniCat * [13]57.055.7--
AAAI’24HTT * [14]71.173.483.187.3
AAAI’24TOP-ReID (A) * [7]72.376.684.789.4
AAAI’24TOP-ReID (B) * [7]64.664.677.482.4
CVPR’24EDITOR (A) * [8]66.568.381.188.2
CVPR’24EDITOR (B) * [8]65.768.882.589.1
OursMambaReID (A)72.276.084.089.0
OursMambaReID (B)65.167.478.484.4
(b) Comparison on RGBNT100 and MSVR310
MethodsRGBNT100MSVR310
ModalityVenueModelmAPR-1mAPR-1
SingleECCV’18PCB [38]57.283.523.242.9
ACM MM’18MGN [41]58.183.126.244.3
ICCV’19DMML [42]58.582.019.131.1
CVPRW’19BoT [22]78.095.123.538.4
ICCV’19OSNet [39]75.095.628.744.8
CVPR’20Circle Loss [43]59.481.722.734.2
ICCV’21HRCN [44]67.191.823.444.2
TPAMI’21AGW [1]73.192.728.946.9
ICCV’21TransReID * [45]75.692.918.429.6
MultiAAAI’20HAMNet [5]74.593.327.142.3
AAAI’21PFNet [4]68.194.123.537.4
ICSP’22GAFNet [10]74.493.4--
Inform Fusion’22CCNet [32]77.296.336.455.2
ArXiv’23GraFT * [46]76.694.3--
TITS’23GPFNet [47]75.094.5--
Sensors’23PHT * [12]79.992.7--
NeruIPSW’23UniCat * [13]79.496.2--
AAAI’24HTT * [14]75.792.6--
AAAI’24TOP-ReID (A) * [7]73.792.230.233.7
AAAI’24TOP-ReID (B) * [7]81.296.435.944.6
CVPR’24EDITOR (A) * [8]79.893.935.843.1
CVPR’24EDITOR (B) * [8]82.196.439.049.3
OursMambaReID (A)76.693.435.846.0
OursMambaReID (B)78.694.346.159.4
Table 2. Comparative performance of different components.
Table 2. Comparative performance of different components.
ModuleComplexityRGBNT201
DMCVFParams (M)FLOPs (G)mAPR-1R-5R-10
A55.6726.9963.8167.5777.9783.77
B59.4727.5668.0370.9679.8884.69
C55.6726.9968.0772.3882.7086.96
D59.4727.5672.2075.9683.9789.00
Table 3. Comparative number of parameters for different methods. The symbol * represents the Transformer-based methods, while † denotes the Mamba-based methods. For the rest of the methods, they are CNN-based methods.
Table 3. Comparative number of parameters for different methods. The symbol * represents the Transformer-based methods, while † denotes the Mamba-based methods. For the rest of the methods, they are CNN-based methods.
MethodsParamsRGBNT100
VenueModelMmAPRank-1
ECCV’18PCB [38]72.3357.283.5
ICCV’19OSNet [39]7.0275.095.6
AAAI’20HAMNet [5]78.0074.593.3
Inform Fusion’22CCNet [32]74.6077.296.3
ICSP’22GAFNet [10]130.0074.493.4
ICCV’21TransReID * [45]278.2375.692.9
NeruIPSW’23UniCat * [13]259.0279.496.2
ArXiv’23GraFT * [46]101.0076.694.3
AAAI’24TOP-ReID * [7]324.5381.296.4
CVPR’24EDITOR * [8]118.5582.196.4
OursMambaReID 59.4778.694.3
Table 4. Performance comparison with different baselines. The symbol ‡ denotes the backbone’s last stride set to 1.
Table 4. Performance comparison with different baselines. The symbol ‡ denotes the backbone’s last stride set to 1.
ModelParamsFLOPsRGBNT201
BackboneMGmAPRank-1Rank-5Rank-10
ViT87.0434.0763.1864.5877.9184.56
VMamba88.0630.0955.9857.8571.6078.65
VMamba 88.0639.3858.4760.9274.0480.72
TSV55.6726.9963.8167.5777.9783.77
Table 5. Comparison of different depths of CVF.
Table 5. Comparison of different depths of CVF.
CVFParamsFLOPsRGBNT201
DepthMGmAPRank-1Rank-5Rank-10
159.4727.5668.0370.9679.8884.69
262.7528.0866.1167.8978.0483.74
366.0228.6064.8366.9677.9884.03
469.3029.1263.0165.7276.5282.33
572.5829.6465.0768.1878.1883.82
675.8530.1662.4264.5875.0180.49
Table 6. Comparison of different DM settings.
Table 6. Comparison of different DM settings.
DMParamsFLOPsRGBNT201
LastFreqMGmAPR-1R-5R-10
A155.6726.9963.5168.0378.9484.63
B155.6726.9966.6370.9881.3586.23
C255.6726.9966.0470.5181.0286.18
D255.6726.9968.0772.3882.7086.96
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, R.; Xu, L.; Yang, S.; Wang, L. MambaReID: Exploiting Vision Mamba for Multi-Modal Object Re-Identification. Sensors 2024, 24, 4639. https://doi.org/10.3390/s24144639

AMA Style

Zhang R, Xu L, Yang S, Wang L. MambaReID: Exploiting Vision Mamba for Multi-Modal Object Re-Identification. Sensors. 2024; 24(14):4639. https://doi.org/10.3390/s24144639

Chicago/Turabian Style

Zhang, Ruijuan, Lizhong Xu, Song Yang, and Li Wang. 2024. "MambaReID: Exploiting Vision Mamba for Multi-Modal Object Re-Identification" Sensors 24, no. 14: 4639. https://doi.org/10.3390/s24144639

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop