Next Article in Journal
E-Waste Management in Serbia, Focusing on the Possibility of Applying Automated Separation Using Robots
Previous Article in Journal
In Vitro Bioactivities of Cereals, Pseudocereals and Seeds: Assessment of Antiglycative and Carbonyl-Trapping Properties
Previous Article in Special Issue
Multitask Learning-Based Affective Prediction for Videos of Films and TV Scenes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

A Survey on Visual Mamba

by
Hanwei Zhang
1,2,3,
Ying Zhu
4,
Dan Wang
4,
Lijun Zhang
1,
Tianxiang Chen
5,
Ziyang Wang
6 and
Zi Ye
2,*
1
Automotive Software Innovation Center, Chongqing 401331, China
2
Institute of Intelligent Software, Guangzhou 511458, China
3
Department of Computer Science, Saarland University, 66424 Homburg, Germany
4
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310024, China
5
School of Cyber Space and Technology, University of Science and Technology of China, Hefei 230026, China
6
Department of Computer Science, University of Oxford, Oxford OX3 7LD, UK
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(13), 5683; https://doi.org/10.3390/app14135683
Submission received: 22 May 2024 / Revised: 17 June 2024 / Accepted: 27 June 2024 / Published: 28 June 2024
(This article belongs to the Special Issue Application of Artificial Intelligence in Visual Processing)

Abstract

:
State space models (SSM) with selection mechanisms and hardware-aware architectures, namely Mamba, have recently shown significant potential in long-sequence modeling. Since the complexity of transformers’ self-attention mechanism is quadratic with image size, as well as increasing computational demands, researchers are currently exploring how to adapt Mamba for computer vision tasks. This paper is the first comprehensive survey that aims to provide an in-depth analysis of Mamba models within the domain of computer vision. It begins by exploring the foundational concepts contributing to Mamba’s success, including the SSM framework, selection mechanisms, and hardware-aware design. Then, we review these vision Mamba models by categorizing them into foundational models and those enhanced with techniques including convolution, recurrence, and attention to improve their sophistication. Furthermore, we investigate the widespread applications of Mamba in vision tasks, which include their use as a backbone in various levels of vision processing. This encompasses general visual tasks, medical visual tasks (e.g., 2D/3D segmentation, classification, image registration, etc.), and remote sensing visual tasks. In particular, we introduce general visual tasks from two levels: high/mid-level vision (e.g., object detection, segmentation, video classification, etc.) and low-level vision (e.g., image super-resolution, image restoration, visual generation, etc.). We hope this endeavor will spark additional interest within the community to address current challenges and further apply Mamba models in computer vision.

1. Introduction

Deep neural networks have exhibited remarkable performance across various artificial intelligence tasks, with the fundamental architecture playing a crucial role in determining the model’s capabilities. Typically, traditional neural networks comprise multi-layer perceptron (MLP) or fully connected layers [1,2]. Convolutional neural networks (CNNs) [3,4] introduce convolutional and pooling layers, which are particularly effective for processing shift-invariant data like images. Recurrent neural networks (RNNs) [5,6] utilize recurrent cells to handle sequential or time series data. To address the existing issue of CNNs, RNNs, and Graph Neural Networks models only capturing local relationships, the transformer [7,8,9], introduced in 2017, excels at learning long-distance feature representations. Transformers primarily depend on attention-based attention mechanisms, e.g., self-attention and cross-attention, to extract intrinsic features and enhance their representation capability. Pre-trained massive transformer-based models, such as GPT-3 [10], deliver robust performance across various natural language processing datasets, excelling in tasks involving the generation and comprehension of natural language. The remarkable performance of transformer-based models has led to their extensive adoption in vision applications. The key to transformer models is their exceptional skill in capturing long-range dependencies, as well as maximizing the use of large datasets. The feature extraction module is the primary component of vision transformer architectures. In addition, it also processes data using a sequence of self-attention blocks, which can obviously enhance its capacity to analyze images.
Nevertheless, a primary obstacle for transformers is the substantial computational demands of the self-attention mechanism, which can increase quadratically with image resolution. The Softmax operation within the attention blocks can further intensify these computational demands, presenting significant challenges for implementing the abovementioned models on edge and low-resource devices. Apart from that, real-time computer vision systems utilizing transformer-based models must adhere to stringent low-latency standards in order to maintain a high-quality user experience. This scenario emphasizes the continuous evolution of new architectures to enhance performance, although this usually comes with the trade-off of higher computational demands. Numerous new models on the basis of sparse attention mechanisms or innovative neural network paradigms have been put forward to further lower computational costs, while obtaining long-range dependencies and maintaining high performance. SSMs have become a central focus among these developments. As displayed in Figure 1a, the number of publications related to SSM demonstrates an explosive growth trend. Initially devised to simulate dynamic systems in areas including control theory and computational neuroscience using state variables, SSM predominantly describes linear invariant (or stable) systems adapted for deep learning.
As SSMs have evolved, a new class of selective SSMs has emerged, termed Mamba [11]. These have advanced the modeling of discrete data, such as text, with SSMs through two key improvements. Firstly, they feature an input-dependent mechanism for adjusting SSM parameters dynamically, enhancing information filtering. Secondly, Mamba employs a hardware-aware approach that processes data linearly with sequence length, boosting computational speed on modern systems. Inspired by Mamba’s achievements in language modeling, several initiatives are currently aiming to adapt this success to the field of vision. Several studies have explored its integration with mixture-of-experts (MoE) techniques, as demonstrated by works like Jamba [12], MoE-Mamba [13], and BlackMamba [14], outperforming the state-of-the-art architecture transformer-MoE with fewer training steps. As depicted in Figure 1b, since the release of Mamba in December 2023, the number of research papers focusing on Mamba in the vision domain has rapidly increased, reaching a peak in March 2024. This trend suggests that Mamba is emerging as a prominent research area in vision, potentially providing a viable alternative to transformers. Therefore, a review of the current related works is necessary and timely, to provide a detailed overview of this new methodology in this evolving field.
Consequently, we present a comprehensive overview of how Mamba models are used in the vision domain. This paper aims to serve as a guide for researchers looking to delve deeper into this area. The critical contributions of our work include:
  • This survey is the first attempt to offer an in-depth analysis of the Mamba technique in the vision domain, explicitly concentrating on analyzing the proposed strategies.
  • An investigation on how Mamba’s capabilities can be enhanced and combined with other architectures in order to achieve superior performance, by expanding upon the Naive-based Mamba visual framework.
  • We offer an exploration that organizes the literature based on various application tasks. In addition, we establish a taxonomy, identify advancements specific to each task, as well as offer insights on overcoming challenges.
  • To keep up with the rapid development in this field, we will regularly update this review with the latest relevant papers and develop an open-source implementation at https://github.com/ziyangwang007/Awesome-Visual-Mamba (accessed on 25 June 2024).
Here is the structure for the remaining portions of the survey. Section 2 examines the general and mathematical concepts underlying Mamba strategies. Section 3 discusses the naive Mamba visual models and how they integrate with other technologies to enhance performance, as recently proposed. Section 4 explores the application of Mamba technologies in addressing a variety of computer vision tasks. Finally, Section 5 concludes the survey.

2. Formulation of Mamba

Mamba [11] was initially introduced in the domain of natural language processing. As depicted in Figure 2, the original Mamba Block integrated a Gated MLP into the SSM architecture of H3 [15], utilizing an SSM sandwiched between two gated connections alongside a standard local convolution. For the activation function σ , SiLU [16] or Swish activation function [17] is used. The Mamba architecture consists of Mamba blocks that are repeated and interspersed with residual connections and standard normalization. An optional normalization layer (LayerNorm [18] chosen by the original Mamba) is applied in a similar location to RetNet [19].

2.1. State Space Model

Consider a structured SSM mapping a one-dimensional sequence x ( t ) R L to y ( t ) R L through a hidden state h ( t ) R N . With the evolution parameter A R N × N and the projection parameters B R N × 1 , C R 1 × N , such a model is formulated using linear ordinary differential equations
h ( t ) = A h ( t ) + B x ( t ) , y ( t ) = C h ( t ) .

2.1.1. Discretization

To adapt to deep learning, SSMs as continuous-time models are discretized with a zero-order hold (ZOH) assumption. Therefore, the continuous-time parameters A , B are transformed into their discretized counterparts A ¯ , B ¯ with a timescale parameter Δ according to
A ¯ = exp ( Δ A ) , B ¯ = ( Δ A ) 1 ( exp ( Δ A ) I ) · Δ B .
Thus, Equation (1) can be rewritten as
h t = A ¯ h t 1 + B ¯ x t , y t = C h t .
To facilitate understanding of this discretization, we have illustrated it visually in Figure 3. To enhance computational efficiency and scalability, the iterative process in Equation (3) can be synthesized through a global convolution
K ¯ = ( C B ¯ , C A ¯ B ¯ , , A ¯ L 1 B ¯ ) , y = x K ¯ ,
where L is the length of the input sequence x , K ¯ R L serves as the kernel of the SSM, and ∗ represents the convolution operation.

2.1.2. Architectures

SSMs usually serve as independent sequence transformations that can be integrated into neural network architectures that are end-to-end. Here, we introduce several fundamental architectures. Linear attention [20] approximates self-attention with a recurrence mechanism as a simplified form of linear SSM. H3 [15], as illustrated in Figure 2, places an SSM between two gated connections and inserts a standard local convolution before it. Following H3, Hyena [21] substitutes an MLP-parameterized global convolution [22] for the SSM layer. RetNet [19] introduces an extra gate and employs simpler SSM. RetNet uses a variant of multi-head attention (MHA) in place of convolutions, providing an alternate parallelizable computing approach. Inspired by the attention-free transformer [23], the recent RNN design RWKV [24] can be interpreted as the ratio of two SSMs, owing to its primary “WKV” mechanism involving linear time invariance (LTI) recurrences.

2.1.3. Selective SSM

Traditional SSMs demonstrated linear time complexity but their representativity of sequence context is inherently limited by time-invariant parameterization. To overcome the existing constraints, selective SSMs introduce selective scan for interactions among sequential states, as shown below:
B = S B ( x ) , C = S C ( x ) , Δ = τ Δ ( Δ + S Δ ( x ) ) .
This occurs before Equations (2) and (3), so that the parameters B R B × L × N , C B × L × N and Δ B × L × D are dependent on the input sequence x R B × L × D , where B represents the batch size, and D represents number of channels. Normally, S B and S C are linear parameterized projections to dimension N, i.e. L i n e a r N ( · ) , while S Δ ( x ) = B r o a d c a s t D ( L i n e a r 1 ( x ) ) and τ Δ = s o f t p l u s . The choice of S Δ and τ Δ is results from the relationship with RNNs gating mechanisms, which will be explained later.

2.2. Other Key Concepts in Mamba

2.2.1. Selection Mechanism

There is a well-established link between discretizing continuous-time systems and RNN gating [25]. One example of the selection mechanism for SSMs is the traditional gating mechanism of RNNs. When N = 1 , A = 1 , B = 1 , S Δ = L i n e a r ( x ) and τ Δ = s o f t p l u s , then the selective SSM recurrence takes the following form:
g t = σ ( L i n e a r ( x ( t ) ) ) h t = ( 1 g t ) h t 1 + g t x t .

2.2.2. Scan

The selection mechanism was devised to address the constraints of linear time invariance (LTI) models. However, it reintroduces the computation issue associated with SSMs. To enhance GPU utilization and efficiently materialize the state h within the memory hierarchy, hardware-aware state expansion is enabled by selective scan. By incorporating kernel fusion and recomputation with parallel scan, the fused selective scan layer can effectively decrease the quantity of memory I/O operations, leading to a significant acceleration compared to conventional implementations.

2.2.3. Discussion

Compared to RNNs and LSTMs, which struggle with vanishing gradients and long-range dependencies, Mamba provides efficient computation and memory utilization. While transformers excel in batch processing and handling long-range dependencies through attention mechanisms, they incur high computational costs, especially during inference. Mamba introduces a selective SSM, incorporating input-dependent matrices to enhance adaptability, while maintaining the computational advantages of traditional SSMs. Mamba bridges the gap between traditional SSMs and modern neural network architectures by providing a selective dependency mechanism, optimal GPU memory utilization, and linear scalability with context length, therefore offering a promising solution for various sequential data processing tasks.

3. Mamba for Vision

The original Mamba block was designed for one-dimensional sequences, yet vision-related tasks require processing multi-dimensional inputs like images, videos, and 3D representations. Consequently, to adapt Mamba for these tasks, enhancements to the scanning mechanism and architecture of the Mamba block play a vital role in effectively handling multi-dimensional inputs.
The current section presents efforts to enable Mamba to tackle vision-related tasks, while enhancing its efficiency and performance. Initially, we delve into two foundational works, including Vision Mamba [26] and VMamba [27]. These works introduced the Vision Mamba (ViM) block and visual state space (VSS) block, respectively, serving as a foundation for subsequent research endeavors. Subsequently, we explore additional works focused on refining the Mamba architecture as a backbone for vision-related tasks. Lastly, we discuss integrating Mamba with other architectures, including convolution, recurrence, and attention.

3.1. Visual Mamba Block

Drawing inspiration from the visual transformer architecture, it seems natural to preserve the framework of the transformer model, while substituting the attention block with a Mamba block and keeping the rest of the process intact. At the crux of the matter lies adapting the Mamba block to vision-related tasks. Nearly simultaneously, Vision Mamba and VMamba presented their respective solutions: the ViM block and the VSS block.

3.1.1. ViM

ViM block [26], also known as a bidirectional Mamba block, annotates image sequences with position embeddings and condenses visual representations based on a bidirectional SSM. It processes inputs both forward and backward, employing one-dimensional convolution for each direction, as displayed in Figure 4a. The Softplus function ensures non-negative Δ . Forward and backward y are computed via the SSM described in Equations (2) and (3), and then combined through SiLU gating to produce the output token sequence as Figure 5a.

3.1.2. VSS

The VSS block [27] incorporates the pivotal SSM operation. It begins by directing the input through a depth-wise convolution layer, followed by a SiLU activation function, and then through the SSM outlined in Equations (2) and (3) employing an approximate B ¯ . Afterward, the output of the SSM is subjected to layer normalization before being amalgamated with the output of other information streams, as in Figure 5b. To tackle the direction-sensitive problem, they presented a cross-scan module (CSM), which can traverse the spatial domain and transform any non-causal visual image into order patch sequences, as shown in (b) Figure 4. They refined the approximation of B ¯ using the first-order Taylor series B ¯ = ( Δ A ) 1 ( exp ( Δ A ) I ) · Δ B ( Δ A ) ( Δ A ) 1 Δ B = Δ B .

3.2. Pure Mamba

It is clear from Figure 5 what the primary distinction is between ViM and VSS blocks: the ViM block employs separate one-dimensional convolutions for scanning different directions. Conversely, in VSS blocks, different scanning schemes share the same depth-wise convolution layer. Thus, for pure Mamba architectures, we consider designs that use individual one-dimensional convolutions for each scanning direction as an extension of ViM-based approaches, while those sharing a depth-wise convolution layer among different scanning schemes are seen as extensions of VSS-based approaches. In addition to these two approaches, researchers also consider visual data as multi-dimensional data, where the pure Mamba architecture typically relies on the original Mamba block. Therefore, in this subsection, we will introduce the pure Mamba architecture derived from these three branches, followed by a summary of 2D scanning mechanisms utilized in visual Mamba.

3.2.1. ViM-Based

Inspired by the vision transformer architecture, Vision Mamba [26] replaces the transformer encoder with a vision Mamba encoder based on ViM blocks, while retaining the remainder of the process. This involves converting the two-dimensional image into flattened patches, followed by linear projection of these patches into vectors and the addition of position embeddings. A class token represents the entire patch sequence, and subsequent steps involve normalization layers and an MLP layer to derive the final predictions.
LocalMamba [29] is built based on a Vim block, and it introduces a revolutionary approach to scanning that combines global context with localized scanning within distinct windows to capture comprehensive local information. In addition, LocalMamba searches scanning directions across various network layers to identify and utilize the most effective scanning combinations. They proposed two variants, i.e., with plain and hierarchical structures. In addition, they proposed their LocalVim Block, which includes four scanning directions (cf. Figure 4d) shows ViM scanning and partitioning tokens into distinct windows, in addition to their flipped equivalents, to facilitate scanning from tail to head. Additionally, the block incorporates a state space module and a spatial and channel attention module (SCAttn).

3.2.2. VSS-Based

VMamba [27] undergoes four stages after partitioning the input image into patches as Vision Mamba. VMamba stacks several VSS blocks on the feature map with a resolution H 4 × W 4 as Stage 1. In Stage 2, before more VSS blocks are involved, the feature map in Stage 1 undergoes a patch merge operation for downsampling, in order to build hierarchical representations, resulting in an output resolution of H 8 × W 8 . Stage 3 and Stage 4 are the repetition of Stage 1 and Stage 2 with resolutions of H 16 × W 16 and H 32 × W 32 .
The PlainMamba block [28], which is based on the VSS block, uses the following two primary mechanisms to improve its capacity to learn features from two-dimensional images: (i) using a continuous 2D scanning procedure to increase spatial continuity, ensuring tokens in the scanning sequence are adjacent, as depicted in Figure 4c; and (ii) integrating direction-aware updating to encode directional information, allowing the model to recognize spatial relationships between tokens. PlainMamba can address the issue of spatial discontinuity in the 2D scanning mechanisms of Vim and VMamba. It continues to scan in the opposite direction as it moves to a new row or column until it reaches the image’s final vision token. Moreover, PlainMamba does away with the requirement for special tokens.
Within lightweight model designs, EfficientVMamba [30] improves the capabilities of VMamba with an atrous-based selective scan approach, i.e., efficient 2D scanning (ES2D). Instead of scanning all patches from various directions and increasing the total number of patches, ES2D adopts a strategy of scanning forward vertically and horizontally, while skipping patches and maintaining the number of patches unchanged, as shown in Figure 4e. Their efficient visual state space (EVSS) block comprises a convolutional branch for local features, applies ES2D as the SSM branch for global features, and all branches end with a squeeze-excitation block. They employ EVSS blocks for both Stage 1 and Stage 2, while opting for inverted residual blocks in Stage 3 and Stage 4, to enhance the capture of global representations.

3.2.3. Visual Data as Multi-Dimensional Data

Existing models for multi-dimensional data also work for visual-related tasks but often lack the capacity to facilitate inter- and intra-dimensional communication or data independence. The MambaMixer block [37] incorporates a dual selection mechanism that operates across tokens and channels. By linking sequential selective mixers via a weighted averaging mechanism, it enables layers to directly access inputs and outputs from different layers. Mamba-ND [38] expands the application of the SSM to higher dimensions by alternating sequence wandering across layers. Utilizing a similar scanning strategy as VMamba in the 2D scenario, it extends this approach to 3D. Additionally, they advocated for the use of multi-head SSMs as an analog to multi-head attention. In response to the inefficiencies and performance challenges encountered by traditional transformers in image and time series processing, a new architecture named simplified Mamba-based architecture, SiMBA [39] was proposed to incorporate the Mamba block for sequence modeling and Einstein FFT (EinFFT) for channel modeling, hoping to improve the model’s stability and effectiveness when handling image and time series tasks. The Mamba block proves effective at processing long sequence data, while EinFFT represents a novel channel modeling technique. Experimental results demonstrated that SiMBA surpassed the existing SSMs and transformers across multiple benchmark tests.

3.2.4. Summary of 2D Scanning Mechanisms

Scan serves as a key component for Mamba, as when it comes to multi-dimensional inputs, the scanning mechanism matter. As shown in Figure 4, we summarize the existing 2D scanning mechanisms. In particular, direction-aware updating employs a set of learnable parameters { Θ k } to represent both the four cardinal directions and a special begin direction for the initial token, reformulating Equation (3) as follows:
h k ( t ) = A ¯ t h k ( t ) + ( B ¯ t + Θ ¯ k , t ) x ( t ) , y ( t ) = k = 1 4 C t h k ( t ) , y ( t ) = y ( t ) z ( t ) ,
where ⊙ denotes the element-wise multiplication, and z ( t ) is a gating mechanism that modulates the output. Expanding on the fundamental structure of Mamba in Equation (7), we can devise the additional scanning mechanisms depicted in Figure 4.
As a vital element of Mamba, scanning mechanisms not only help the efficiency but also provide information in the scenario of visual-related tasks. In this study, we summarize the usage of different scanning mechanisms in existing works in Table 1. Cross-scan [27] and bidirectional scan [26] stand out as the most widely adopted scanning mechanisms. Nevertheless, various other scanning mechanisms serve specific purposes. For example, 3D bidirectional scan [33] and spatiotemporal selective scan [35] are tailored for video inputs. Local scan [29] focuses on gathering local information, while ES2D [30] prioritizes efficiency.

3.3. Mamba with Other Architectures

Mamba, being a novel component compared to convolution, recurrence, and attention, offers opportunities for synergistic combinations with other architectures that are still relatively underexplored. The combination of Mamba with other architectures typically occurs through two approaches: (i) designing blocks that integrate the concepts of other architectures alongside Mamba blocks, or (ii) incorporating Mamba blocks into existing architectures. In this section, we examine existing exploratory findings on such combinations. We begin by presenting the first type of combination, followed by summarizing the second type in Table 2.
Table 1. Summary of the Scanning mechanisms used in visual Mamba.
Table 1. Summary of the Scanning mechanisms used in visual Mamba.
Scanning MechanismsMethod
BiDirectional Scan [26]Vision Mamba [26], Motion Mamba [34]
HARMamba [40], MMA [41], VL-Mamba [42]
Video Mamba Suite [43], Point Mamba [44]
LMa-UNet [45]
Motion-Guided Dual-Camera Tracker [46]
Cross-Scan [27]VMamba [27],VL-Mamba [42], VMRNN [47]
RES-VMAMBA [48], Sigma [49], ReMamber [50]
Mamba-UNet [51], Semi-Mamba-UNet [52]
VMambaMorph [53], ChangeMamba [54]
H-vmunet [55], MambaMIR [56], MambaIR [57]
Serpent [58], Mamba-HUNet [59], TM-UNet [60]
Swin-UMamba [61], UltraLight VM-UNet [62]
VM-UNet [63], VM-UNET-V2 [64]
MedMamba [65], MIM-ISTD [66], RS3Mamba [67]
Continuous 2D Scanning [28]PlainMamba [28]
Local Scan [29]LocalMamba [29], FreqMamba [68]
Efficient 2D Scanning (ES2D) [30]EfficientVMamba [30]
Zigzag Scan [31]ZigMa [31]
Omnidirectional Selective Scan [32]VmambaIR [32], RS-Mamba [69]
3D BiDirectional Scan [33]VideoMamba [33]
Hierarchical Scan [34]Motion Mamba [34]
Spatiotemporal Selective Scan [35]Vivim [35]
Multi-Path Scan [36]RSMamba [36]

3.3.1. Mamba with Convolution

Convolution, being a widely employed technique, possesses the advantageous property of capturing local information. Consequently, it is frequently integrated with Mamba to augment its capabilities. To construct a new fundamental block, the identity branch of the residual block and convolution layers are commonly employed to enhance the Mamba block. This integration is aimed at enhancing the representational capability of the model and its effectiveness in tasks requiring a thorough comprehension of visual data, by combining localized details captured by Mamba blocks with overarching global features. By incorporating a residual learning framework into the VMamba model, RES-VMAMBA [48] was the first to utilize both local and global state features that were part of the original VMamba architectural design. MambaIR [57] introduced the residue state space (RSS) block, incorporating scale residual connections, a convolutionary layer, and channel attention atop the VSS block. LMa-UNet [45] incorporates residual connections with ViM at both pixel-level and patch-level. nnMamba [70] introduced the Res-Mamba block, which merges the Mamba block with a convolution layer, batch normalization, ReLU activation, residual connections, and weight sharing among channels and spatial dimensions using a Siamese input. SegMamba [71] introduced the TSMamba block, which can enhance the Tri-orientated Mamba with layer normalization, gated spatial convolutional layers, and residual connections. MambaMIR [56] introduced the AMSS block group, which enhances feature extraction for reconstruction and uncertainty estimation by incorporating a convolutional layer and layer normalization ahead of a VSS-based Mamba block called AMSS. MedMamba [65] introduced the SS-Conv-SSM block, which comprises a convolution branch and a VSS branch.
Table 2. Summary of visual Mamba with other architectures.
Table 2. Summary of visual Mamba with other architectures.
Other ArchitectureMamba MethodCapability
ConvolutionRES-VMAMBA [48]Food vision tasks
MedMamba [65]Medical images classification tasks
HSIMamba [72]Hyperspectral images classification
tasks
MambaMIR [56]Medical images reconstruction tasks
MambaMIR-GAN [56]
MambaIR [57]Image restoration tasks
VMambaMorph [53]3D images registration tasks
FreqMamba [68]Image deraining tasks
Pan-Mamba [73]Pan-sharpening tasks
MambaTalk [74]Gesture synthesis tasks
Samba [75]Images semantic segmentation tasks
Semi-Mamba-UNet [52], Swin-UMamba [61]Medical images segmentation tasks
H-vmunet [55], UltraLight VM-UNet [62]
Weak-Mamba-UNet [76]
LMa-UNet [45], SegMamba [71], T-Mamba [77]
Vivim [35], nnMamba [70], ProMamba [78]
Recurrence VMRNN [47]Video prediction tasks
VMambaMorph [53]3D images registration tasks
Attention SSM-ViT [79]Event camera-based tasks
MMA [41]Image super-resolution tasks
ViS4mer [80]Long movie clip classification tasks
FDVM-Net [81]Images exposure correction tasks
CMViM [82]3D multi-modal representation tasks
Motion-Guided Dual-Camera Tracker [46]Endoscopy skill evaluation tasks
MambaIR [57]Image restoration tasks
FreqMamba [68]Image deraining tasks
3DMambaComplete [83]Point cloud completion tasks
VM-UNET-V2 [64], Weak-Mamba-UNet [76]Medical images segmentation tasks
UltraLight VM-UNet [62], ProMamba [78]
U-Net U-Mamba [84], UVM-Net [85], Mamba-UNet [51]Medical images tasks
TM-UNet [60], Semi-Mamba-UNet [52]
Swin-UMamba [61], Weak-Mamba-UNet [76]
LMa-UNet [45], LightM-UNet [86]
UltraLight VM-UNet [62], VM-UNET-V2 [64]
H-vmunet [55], Mamba-HUNet [59]
VM-UNet [63]
MambaMIR-GAN [56]Medical images reconstruction tasks
VmambaIR [32]Image restoration tasks
Motion Mamba [34]Generation tasks
MambaMorph [87]Multi-modality registration tasks
FreqMamba [68]Image deraining tasks
RS-Mamba [69]Dense image prediction tasks
DiffusionDiS [88], ZigMa [31], Motion Mamba [34]
SSM-based diffusion model [89]
Generation tasks
MD-Dose [90]Radiation dose prediction tasks
In particular, specialized Mamba-based blocks have been devised to capture frequency-based information, leveraging techniques from convolutional networks. T-Mamba [77] introduced the Tim block, which integrates frequency-based bandpass filtering atop convolutional shared dual position encoding compensation and a gate selection unit. HSIMamba [72] introduced the HyperspectralBiNetworks block, which was derived from ViM but tailored for spectral inputs. Vivim [35] introduced the Temporal Mamba Block, integrating a spatiotemporal version of ViM named ST-Mamba, along with detail-specific FFN, convolution, and layer normalization. FreqMamba [68] introduced the FreqSSM block, incorporating a convolutional layer for the spatial branch, discrete wavelet transformation with the SSM block for the frequency band branch, and a Fourier modeling branch implemented with a convolutional layer.
A branch of research directly integrates Mamba blocks with convolutional layers into existing architectures. VMambaMorph [53] utilizes a hybrid approach, employing half of a 3D VSS block and half of a 3D CNN to construct a U-shaped network to serve as the registration module. Pan-Mamba [73] and UltraLight VM-UNet [62] incorporate convolutional layers at the start and end of their architecture to enhance feature quality. RS3Mamba [67] employs VSS blocks to construct an auxiliary encoder and utilizes convolution-based CCM modules and residual blocks as the main encoder for semantic segmentation of remote sensing images. Samba [36] utilizes convolution as a stem at the beginning of the architecture. H-vmunet [55] integrates high-order VSS blocks into a CNN-based U-shaped network for medical image segmentation tasks. Swin-UMamba [61] substitutes attention with VSS blocks in a Swin model [91], incorporating hierarchical Mamba with shifted windows. Additionally, it integrates a CNN-based U-shaped network architecture. Semi-Mamba-UNet [52] combines Mamba-based and CNN-based U-shaped branches to segment medical images in a semi-supervised manner. Furthermore, Weak-Mamba-UNet [76] enhances performance by incorporating an additional ViT-based U-shaped branch. MambaTalk [74] employs two convolutional-based audio feature extraction networks along with a Mamba model for motion synthesis. ProMamba [78] consists of a ViM-based image encoder, a Transformer-based prompt encoder, and a CNN-based mask decoder.

3.3.2. Mamba with Recurrence

To harness the long-sequence modeling capabilities of Mamba blocks and the spatiotemporal representation prowess of LSTMs, the VMRNN [47] Cell eliminates all weights and biases in ConvLSTM [92] and employs VSS blocks to learn spatial dependencies vertically. Long-term and short-term temporal dependencies are captured in the VMRNN Cell by updating the information on cell states and concealed states from a horizontal perspective. Building upon the VMRNN Cell, two variants have been proposed: VMRNN-B and VMRNN-D. VMRNN-B mainly focuses on stacking VMRNN layers, while VMRNN-D incorporates more VMRNN Cells and introduces patch merging and patch expanding layers. By downsampling the data and lowering its spatial dimensions, the patch merging layer helps to capture more abstract, global features, while also lowering the computational complexity. In contrast, upsampling employs a patch-expanding layer to increase the spatial dimensions in order to recover detail and facilitate accurate feature localization during the reconstruction stage. Ultimately, the reconstruction layer creates the predicted frame for the subsequent time step by scaling the concealed state from the VMRNN layer back to the input size. Integrating downsampling and upsampling processes offers important benefits for a predictive architecture. By making the input representation simpler through downsampling, the model can process higher-level features with minimal computational cost. This is especially helpful for grasping the intricate linkages and patterns in the data more abstractly. In addition, in VMambaMorph [53], a recursive registration framework integrated a hybrid VSS and CNNs-based VMambaMorph as the registration module.

3.3.3. Mamba with Attention

Attention mechanisms, like self-attention and cross-attention, empower transformers to concentrate on pertinent segments of the input sequence. This attention-driven strategy boosts the model’s capacity to assess the significance of various elements, resulting in more intricate and contextually informed representations. Cross-attention stands out as the most prevalent element for integrating with Mamba to facilitate information exchange. FD-Vision Mamba [81] introduced the C-SSM block, which merges the SSM block with cross-attention to facilitate information exchange between the amplitude and phase branches. SpikeMba [93] integrated the SNN block with simplified cross-attention in a spiking saliency detector to enable information exchange between text features and relevant slots. Subsequently, it employs multi-modal relevant Mamba blocks to bolster long-range dependency. MambaIR [57] uses channel attention as part of its RSS block. The meet more areas (MMA) block introduced in [41] adopts a MetaFormer-style architecture, comprising two layer normalization layers, a token mixer (consisting of a channel attention mechanism and a ViM block in parallel), and an MLP block for deep feature extraction. Instead of using cross-attention, the SSM-ViT block [79] comprises three main components: a self-attention block (Block-SA), a dilated attention block (Grid-SA), and an SSM block. The block-SA focuses on immediate spatial relations and provides a detailed representation of nearby features. Grid-SA offers a global perspective, capturing comprehensive spatial relations and overall input structure. The SSM block ensures temporal consistency and a smooth information transition between consecutive time steps. By integrating SSMs with self-attention, the SSM-ViT block enables faster training and parameter timescale adjustment for temporal aggregation.
Few works have leveraged attention at the architectural level to enhance Mamba’s performance. In ViS4mer [80], self-attention is employed to process each frame and obtain features for the SSN-based multi-scale decoder. Following the extraction of the short-range spatiotemporal features by the normal transformer encoder, the long-term temporal reasoning is captured by the Mamba-based multi-scale temporal S4 decoder. Thus, ViS4mer achieves decent performance in understanding long videos. CMViM [82] incorporates a single cross-attention layer after the online ViM encoder to facilitate information exchange between MRI and PET branches. 3DMambaComplete [83] integrates attention blocks into the HyperPoint generation process to enhance features extracted from incomplete point clouds and FPS. FreqMamba [68] exploits the distinctive data-dependent characteristic of Mamba alongside attention to identify potential degradation locations at different granular levels. ProMamba [78] employs self-attention and prompt-to-image attention mechanisms within the prompt encoder. UltraLight VM-UNet [62] employs spatial and channel attention mechanisms to facilitate weight sharing. Conversely, VMUNetV2 [64] introduces an SDI block, which computes attention scores for both spatial and channel dimensions. This block is positioned between the encoder, composed of VSS blocks, and the decoder, consisting of fusion blocks. In the motion-guided dual-camera tracker [46], two crucial elements are employed: a cross-camera mutual template strategy (CMT) and a Mamba-based motion-guided prediction head (MMH). Inspired by cross-attention, CMT aggregates features from dual cameras, while MMH utilizes a ViM block to capture motion tokens. The integration of vision and motion is facilitated by a cross-attention module.

3.3.4. Others

The U-shape net and diffusion architectures serve as fundamental frameworks frequently combined with Mamba blocks, particularly in the medical field. Given their prevalence, we believe that it is important to highlight them, so we have compiled related works in Table 2.

3.4. Comparison of Mamba Models and Other State-of-the-Art Models

In this section, we have thoroughly summarized the performance of various visual Mamba backbone networks on standard benchmarks and conducted an in-depth comparison with the performance of some backbones from CNN and Transformer models on the same datasets, as shown in Table 3, Table 4, Table 5 and Table 6. Our analysis focused on three public datasets: ImageNet-1K for classification, COCO for object detection, ADE10K for semantic segmentation. By analyzing the performance and computational complexity of models with different architectures, we can gain insights into each model’s advantages, guiding model selection and providing valuable references for future research and practical applications in computer vision.
According to these tables, we clearly observe that compared to CNN and Transformer models, Mamba either achieved better performance or required less computational resources. In Table 3, the best performance achieved by CNN was 81.7 % with 39M parameters, 84.5 % with 88M parameters by transformer, while Mamba achieved 84.7 % with 40M parameters. Similar superior performances by Mamba are observed in Table 4 and Table 6. An exception is noted in Table 5, where a Transformer achieved the best performance of 51.9 % with 145M parameters and 982 FLOPs, whereas Mamba, with a maximum of 69M parameters, achieved an 49.9 % average precision (AP).

3.4.1. Analysis and Comparison in Image Classification Tasks

As shown in Table 3, the ImageNet-1K dataset is designed for image classification tasks, where accuracy is the key measure of performance. Therefore, we used the Top-1 Accuracy metric to compare the classification capabilities of different models. The comparison shows that CNN models have moderate parameters and computational complexity but relatively low Top-1 accuracy, mostly below 80%. The highest Top-1 accuracy among CNN models was 81.7% for RegNetY-8G [94]. In contrast, most Transformer models have more parameters and greater computational complexity. For example, the ViT-L/16 [95] model has 307M parameters and 190.7G FLOPs, making it suitable for high-performance computing environments. In such environments, Transformer models generally achieve higher Top-1 accuracy, such as 83.7% for ViL-Base-RPB [96] and 83.8% for Focal-Base [97]. Mamba models exhibit various parameters and computational complexities, accommodating various application needs. They include lightweight models like EfficientVMamba-T [30] and more complex models like VMamba-B [27]. Many Mamba models achieved Top-1 accuracy above 83%, such as LocalVMamba-S [29] and SiMBA-B [39].
From Table 3, it is evident that CNN models do not perform as well in terms of accuracy compared to Transformer and Mamba models. Let us focus on comparing the Transformer and Mamba models. Among the Transformer models, Swin-B [91] performed the best, with a Top-1 accuracy of 84.5%. Of the Mamba models, SiMBA-B (MLP) [39] performed the best, with a Top-1 accuracy of 84.7%. Notably, Swin-B used an image size of 384 2 , whereas SiMBA-B (MLP) used an image size of 224 2 . Swin-B can provide more detailed information, but SiMBA-B (MLP) still outperformed Swin-B. Additionally, because Swin-B uses a larger image size, its parameters and computational complexity are significantly higher than SiMBA-B’s (MLP).
Based on the above analysis, we can summarize the advantages and disadvantages of these three types of models in image classification tasks. CNN models have relatively low parameters and computational complexity, making them suitable for environments with limited resources, but perform worse for accuracy. Transformer models have relatively high accuracy, higher parameters, and computational complexity, requiring substantial resources, thus fitting high-performance computing environments. Mamba models are more diverse, fitting different computational resources and application requirements, with many models being competitive in accuracy. However, researchers must select the appropriate model based on specific application scenarios.

3.4.2. Analysis and Comparison in Object Detection and Instance Segmentation Tasks

Since the COCO dataset is used for object detection and instance segmentation tasks, it not only requires recognizing object categories but also accurately localizing them. Therefore, average precision (AP) metrics are needed to comprehensively evaluate a models’ performance at different intersection over union (IoU) thresholds. To fully measure the performance of models under different training configurations, we have provided two tables, which show the performance of Mask R-CNN under the a 1 × schedule and 3 × M S schedule for object detection and instance segmentation tasks on the COCO dataset, as shown in Table 4 and Table 5.
Table 3. Comparison of Mamba models and other state-of-the-art models on t ImageNet-1K dataset.
Table 3. Comparison of Mamba models and other state-of-the-art models on t ImageNet-1K dataset.
ModelBackboneImage SizeParams (M)FLOPs (G)Top-1 ACC (%)
CNNResNet-50 [98] 224 2 25.54.176.50
ResNet-50-D [99] 224 2 25.04.377.16
ResNet-101 [98] 224 2 44.67.877.4
ResNet-152 [98] 224 2 60.211.678.3
ResNeXt-50- 32 × 4 d [100] 224 2 254.177.8
ResNeXt-101- 32 × 4 d [100] 224 2 447.878.8
RegNetY-4G [94] 224 2 214.080.0
RegNetY-8G [94] 224 2 398.081.7
TransformerViT-B/16 [95] 384 2 8655.477.9
ViT-L/16 [95] 384 2 307190.776.5
DeiT-S [101] 224 2 224.679.8
DeiT-B [101] 224 2 8617.681.8
DeiT-B [101] 384 2 8655.483.1
Swin-T [91] 224 2 294.581.3
Swin-S [91] 224 2 508.783.0
Swin-B [91] 224 2 8815.483.5
Swin-B [91] 384 2 8847.084.5
ViL-Small-APE [96] 224 2 24.64.982.0
ViL-Small-RPB [96] 224 2 24.64.982.4
ViL-Medium-APE [96] 224 2 39.78.783.3
ViL-Medium-RPB [96] 224 2 39.78.783.5
ViL-Base-APE [96] 224 2 55.713.483.2
ViL-Base-RPB [96] 224 2 55.713.483.7
Focal-Tiny [97] 224 2 29.14.982.2
Focal-Small [97] 224 2 51.19.183.5
Focal-Base [97] 224 2 89.816.083.8
MambaVim-Ti [26] 224 2 7-76.1
Vim-S [26] 224 2 26-80.5
VMamba-T [27] 224 2 224.582.2
VMamba-S [27] 224 2 449.183.5
VMamba-B [27] 224 2 7515.283.2
PlainMamba-L1 [28] 224 2 73.077.9
PlainMamba-L2 [28] 224 2 258.181.6
PlainMamba-L3 [28] 224 2 5014.482.3
LocalVim-T [29] 224 2 81.576.2
LocalVim-S [29] 224 2 284.881.2
LocalVMamba-T [29] 224 2 265.782.7
LocalVMamba-S [29] 224 2 5011.483.7
EfficientVMamba-T [30] 224 2 60.876.5
EfficientVMamba-S [30] 224 2 111.378.7
EfficientVMamba-B [30] 224 2 334.081.8
Mamba-2D-S [38] 224 2 24-81.7
Mamba-2D-B [38] 224 2 92-83.0
SiMBA-S (Monarch) [39] 224 2 18.53.681.1
SiMBA-S (EinFFT) [39] 224 2 15.32.481.7
SiMBA-S (MLP) [39] 224 2 26.55.084.0
SiMBA-B (Monarch) [39] 224 2 26.95.582.6
SiMBA-B (EinFFT) [39] 224 2 22.84.283.0
SiMBA-B (MLP) [39] 224 2 40.09.084.7
SiMBA-L (Monarch) [39] 224 2 428.783.8
SiMBA-L (EinFFT) [39] 224 2 36.67.683.9
As seen in Table 4, when using the Mask R-CNN 1 × schedule, the precision metrics of CNN models improved as the number of parameters and FLOPs increased. Compared to Transformer and Mamba models, CNN models had moderate parameters and FLOPs but relatively lower performance. Among the Transformer models, the ViT-Adapter-B [102] performed excellently but with relatively high parameters. Transformer models offer good performance but generally have higher parameters and FLOPs. On the other hand, Mamba models, while maintaining lower parameters and computational complexity, can deliver performance very close to or even surpassing some Transformer models, particularly with the LocalVMamba [29] and VMamba [27] backbones.
Table 4. Comparison of Mamba Models and other state-of-the-art models on COCO dataset (Mask R-CNN 1 × schedule).
Table 4. Comparison of Mamba Models and other state-of-the-art models on COCO dataset (Mask R-CNN 1 × schedule).
ModelBackbone AP b AP 50 b AP 75 b AP m AP 50 m AP 75 m Params (M)FLOPs (G)
CNNResNet-50 [98]38.258.841.434.755.737.244260
ResNet-101 [98]38.258.841.434.755.737.263336
ResNeXt101- 32 × 4 d [100]41.9--37.5--63340
ResNeXt101- 64 × 4 d [100]42.8--38.4--102493
TransformerViT-Adapter-T [102]41.162.544.337.559.739.928.1-
ViT-Adapter-S [102]44.765.848.339.962.542.847.8-
ViT-Adapter-B [102]47.068.251.441.865.144.9120.2-
Swin-Tiny [91]42.2--39.1--48264
Swin-Small [91]44.8--40.9--69354
PVT-Tiny [103]36.759.239.335.156.737.332.9-
PVT-Small [103]40.462.943.837.860.140.344.1-
PVT-Medium [103]42.064.445.639.061.642.163.9-
PVT-Large [103]42.965.046.639.561.942.581.0-
MambaVMamba-T [27]46.568.550.742.165.545.342262
VMamba-S [27]48.269.752.543.066.646.464357
VMamba-B [27]48.569.653.043.167.046.496482
PlainMamba-Adapter-L1 [28]44.164.847.939.161.641.931388
PlainMamba-Adapter-L2 [28]46.066.950.140.663.843.653542
PlainMamba-Adapter-L3 [28]46.868.051.141.264.743.979696
EfficientVMamba-T [30]35.657.738.033.254.435.11160
EfficientVMamba-S [30]39.361.842.636.758.939.231197
EfficientVMamba-B [30]43.766.247.940.263.342.953252
LocalVMamba-T [29]46.768.750.842.265.745.545291
LocalVMamba-S [29]48.469.952.743.266.746.569414
SiMBA-S [39]46.968.651.742.665.945.860382
Table 5. Comparison of Mamba models and other state-of-the-art models on COCO dataset (Mask R-CNN 3 × M S schedule).
Table 5. Comparison of Mamba models and other state-of-the-art models on COCO dataset (Mask R-CNN 3 × M S schedule).
ModelBackbone AP b AP 50 b AP 75 b AP m AP 50 m AP 75 m Params (M)FLOPs (G)
CNNConvNeXt-T [104]46.267.950.841.765.044.948262
TransformerSwin-T [91]50.569.354.943.766.647.186745
Swin-S [91]51.870.456.344.767.948.5107838
Swin-B [91]51.970.956.545.068.448.7145982
ViT-Adapter-T [102]46.067.650.441.064.444.128.1-
ViT-Adapter-S [102]48.269.752.542.866.445.947.8-
ViT-Adapter-B [102]49.670.654.043.667.746.9120.2-
PVT-Tiny [103]39.862.243.037.459.339.932.9-
PVT-Small [103]43.065.346.939.962.542.844.1-
PVT-Medium [103]44.266.048.240.563.143.563.9-
PVT-Large [103]44.566.048.340.763.443.781.0-
ViL-Tiny-RPB [96]44.266.448.240.663.244.026.9199
ViL-Small-RPB [96]47.168.751.542.765.946.245.0277
ViL-Medium-RPB [96]48.970.354.044.267.947.760.1352
ViL-Base-RPB [96]49.670.754.644.568.348.076.1439
Focal-Tiny [97]47.269.451.942.766.545.948.8291
Focal-Small [97]48.870.553.643.867.747.271.2401
Focal-Base [97]49.070.153.643.767.647.0110.0533
MambaVMamba-T [27]48.569.952.943.266.846.342262
VMamba-S [27]49.770.454.244.067.647.364357
LocalVMamba-T [29]48.770.153.043.467.046.445291
LocalVMamba-S [29]49.970.554.444.167.847.469414
From Table 5, it is evident that the characteristics of the three models become more pronounced under the Mask R-CNN 3 × M S schedule. In object detection and instance segmentation tasks, CNN models, while suitable for resource-constrained environments, are somewhat less accurate. Transformer models are suited for high-performance computing environments, demanding significant computational resources but achieving excellent performance. Mamba models balance performance and computational complexity well, providing outstanding performance with relatively lower computational resource requirements.

3.4.3. Analysis and Comparison in Semantic Segmentation Tasks

The ADE20K dataset was used for semantic segmentation tasks. As shown in Table 6, we used the mIoU metric to compare and analyze the performance of various models on this dataset. mIoU, or mean intersection over union, is the standard metric for measuring model performance in segmentation tasks, reflecting the model’s accuracy in handling complex scenes and objects of different scales.
Table 6. Comparison of Mamba models and other state-of-the-art models on ADE20K datasets.
Table 6. Comparison of Mamba models and other state-of-the-art models on ADE20K datasets.
ModelBackboneImage SizeParams (M)FLOPs (G)mIoU (SS)mIoU (MS)
CNNResNet-50 [98] 512 2 6795342.142.8
ResNet-101 [98] 512 2 85103042.944.0
ConvNeXt-T [104] 512 2 6093946.046.7
ConvNeXt-S [104] 512 2 82102748.749.6
ConvNeXt-B [104] 512 2 122117049.149.9
TransformerSwin-T [91] 512 2 6094544.445.8
Swin-S [91] 512 2 81103947.649.5
Swin-B [91] 512 2 121118848.149.7
Focal-T [97] 512 2 6299845.847.0
Focal-S [97] 512 2 85113048.050.0
Focal-B [97] 512 2 126135449.050.5
DeiT-S + MLN [105] 512 2 58121743.845.1
DeiT-B + MLN [105] 512 2 144200745.547.2
MambaVim-Ti [26] 512 2 13-41.0-
Vim-S [26] 512 2 46-44.9-
VMamba-T [27] 512 2 5593947.348.3
VMamba-S [27] 512 2 76103749.550.5
VMamba-B [27] 512 2 110116750.051.3
VMamba-S [27] 640 2 76162050.850.8
PlainMamba-L1 [28] 640 2 3517444.1-
PlainMamba-L2 [28] 640 2 5528546.8-
PlainMamba-L3 [28] 640 2 8141949.1-
LocalVim-T [29] 512 2 3618143.444.4
LocalVim-S [29] 512 2 5829746.447.5
LocalVMamba-T [29] 512 2 5797047.949.1
LocalVMamba-S [29] 512 2 81109550.051.0
EfficientVMamba-T [30] 512 2 1423038.939.3
EfficientVMamba-S [30] 512 2 2950541.542.1
EfficientVMamba-B [30] 512 2 6593046.547.3
SiMBA-S [39] 512 2 62104049.049.6
In semantic segmentation tasks, CNN models exhibit relatively stable mIoU performance under both single-scale (SS) and multi-scale (MS) settings. ConvNeXt [104] models perform slightly better than ResNet [98] models, with ConvNeXt-B achieving a mIoU (SS) of 49.1 and a mIoU (MS) of 49.9. Compared to other models, CNNs offer stable and reliable performance, making them suitable for resource-limited applications, though they may fall short in tasks requiring high precision.
Among Transformer models, Swin-B [91] stands out on the ADE20K dataset, with a mIoU (SS) of 48.1 and a mIoU (MS) of 49.7. Focal [97] models also perform well, particularly Focal-B, which achieved mIoU (SS) and mIoU (MS) scores of 49.0 and 50.5, respectively. However, Transformer models generally demand significant computational resources. For instance, while delivering excellent performance, Focal-B required 126M parameters and 1354G FLOPs. Similarly, DeiT-B + MLN [105] required 144M parameters and 2007G FLOPs.
In the Mamba model category, the VMamba [27] backbones showed outstanding performance under both single-scale and multi-scale tests, particularly VMamba-S ( 640 2 ), which achieved a mIoU (SS) and mIoU (MS) of 50.8. The PlainMamba [28] and LocalVim [29] backbones also provide high mIoU scores with lower parameters and FLOPs, such as PlainMamba-L3, which achieved a mIoU (SS) of 49.1.
Therefore, while both Transformer and Mamba models perform well in semantic segmentation tasks, Transformer models require high computational resources, making them suitable for high-performance computing environments. Mamba models, on the other hand, offer a diverse range of options, including the high-performance VMamba [27] backbones and the low-complexity yet high-performance LocalVim [29] backbones, catering to various computational resource scenarios. Mamba models balance performance and computational complexity well, making them ideal for applications requiring high precision but with limited computational resources.

4. Visual Mamba in Application Fields

Mamba-based modules increase the efficiency of processing sequential data, adeptly capturing long-range dependencies and seamlessly integrating into existing systems. In medical visual tasks and remote sensing images, where inputs usually entail high-resolution data, Mamba emerges as a pivotal tool for augmenting various visual tasks, especially those pertinent to medical applications.
In the current section, we began by highlighting the contributions of Mamba-based modules in enhancing general visual-related tasks. Then, we delved into their specific impact on medical visual tasks and remote-sensing images.

4.1. General Visual Mamba

General vision-related tasks are categorized into high/mid-level vision and low-level vision. high/mid-level vision includes recognition tasks for different input formats (pictures, videos, and 3D representation), including segmentation, object detection, classification, and prediction. By contrast, low-level vision includes restoration, generation etc., as shown in Table 7.

4.1.1. High/Mid-Level Vision

The visual Mamba backbone [26,27,28,29,38] had decent performance in classification, object detection, and segmentation. SSM-ViT [79] was designed for object detection using event cameras. Differently from standard frame-based cameras, event cameras record per-pixel relative brightness changes in a scene as they occur. Therefore, object detection with event cameras requires processing an asynchronous stream of events in a four-dimensional spatio-temporal space. Earlier studies used RNNs architectures with convolutional or attention mechanisms to develop models exhibiting superior performance on downstream tasks using event camera data. However, these models usually suffer from slow training. As a response, the SSM-ViT block was introduced by leveraging an SSM for efficient event-based information processing. It explores two strategies to mitigate aliasing effects when deploying the model at higher frequencies.
For referencing image segmentation (RIS), a difficult problem in the field of multi-modal comprehension, ReMamber [50] was introduced, utilizing Mamba’s notable advances in efficient training and inference with linear complexity. Distinguished from conventional segmentation, RIS entails identifying and segmenting specific objects in images according to textual descriptions. The ReMamber architecture comprises several Mamba Twister blocks, each featuring multiple VSS blocks and a Twisting layer. The Mamba Twister block is a multi-modal feature fusion block that blends textual and visual features into a single output, which is the fused multi-modal feature representation. The last segmentation mask is generated by retrieving intermediate features after every Mamba Twister block and feeding them into a flexible decoder. The VSS layers are tasked with extracting visual features, while the Twisting layer primarily captures effective visual-language interactions. The experiments conducted by authors on various RIS datasets produced cutting-edge outcomes. Sigma [49] presented a novel network tailored for multimodal semantic segmentation tasks. Following each Mamba Twister block, intermediate features are retrieved and input into a flexible decoder, which generates the segmentation mask at the end. Furthermore, a channel-aware Mamba decoder and an attention-based Mamba fusion mechanism were presented. During the decoding phase, the fused features undergo further enhancement through channel-aware VSS (CVSS) blocks, adept at capturing multi-scale long-range information and facilitating cross-modal information integration.
Unlike transformers that depend on quadratic complexity attention mechanisms, Mamba, as a pure SSM-based model, excels in handling long sequences with linear complexity and is particularly adept at processing lengthy videos at high resolutions. ViS4mer [80] serves as a model primarily used for recognizing and classifying long videos, especially for understanding and categorizing lengthy movie clips. ViS4mer is composed of two primary parts: a multi-scale temporal S4 decoder suited for further long-range temporal reasoning, and a standard Transformer encoder intended for short-distance spatiotemporal feature extraction from videos. The multi-scale temporal S4 decoder is based on SSM and makes use of the ability of the core SSM to identify long-range correlations in consecutive data, in order to reduce the computational cost of the model.
The Video Mamba Suite [43] is not a novel method; rather, it investigates and evaluates SSM’s potential, embodied by Mamba, in tasks related to comprehending videos. The decomposed bidirectionally Mamba (DBM) block is an improved version of the ViM block that shares the SSM parameters in both scanning directions, while allowing the input projector to be separated. They classify Mamba into four distinct roles for modeling videos and compare it with existing Transformer-based models to evaluate its effectiveness in various video understanding tasks. The 14 models and modules that make up the Video Mamba Suite were used to assess performance on 12 different video comprehension tasks. The experiments showed that Mamba is applicable in video analysis and can be used for more complex, multimodal video understanding challenges. Apart from the Video Mamba Suite, VideoMamba [33] was proposed for video understanding tasks, with a specific focus on addressing the following two major challenges: local redundancy and global dependencies. The study evaluated VideoMamba’s capabilities across four key aspects: scalability in the video domain, sensitivity to short-term action recognition, advantages in long-term video understanding, and compatibility with other modalities. To enhance model scalability in the visual domain, a self-distillation strategy is used in VideoMamba. This approach significantly enhances the model’s performance as both the model and input sizes increase, without the need for pretraining on large-scale datasets. While the ViM block enhances the model’s spatial perception capabilities, VideoMamba extends this capability to 3D video understanding by including spatio-temporal bidirectional scanning. By extending the ViM block, VideoMamba achieves a significant elevation in processing speed and a reduction in computational resource consumption without compromising performance. SpikeMba [93] presents a pioneering multimodal video content understanding framework geared towards the task of temporal video localization. The proposed framework amalgamates spiking neural networks (SNNs) with SSM blocks in order to discern intricate relationships within multimodal input features. The spike saliency detector (SSD) leverages SNNs’ thresholding mechanism to generate sets of saliency proposals, denoting highly pertinent or salient instances in a video via spikes. Furthermore, based on SSM, the multimodal relevance Mamba block (MRM) retains linear complexity within respect to input size, while increasing long-range dependency modeling.
Multimodal large language models (MLLMs), on the basis of transformers, have demonstrated significant success across diverse domains, albeit with secondary computational intricacy. In order to improve these models’ efficiency, Mamba has been included in this work. In order to produce the most effective multimodal representation, Cobra [106] combines an effective Mamba language model into the visual modality and investigates several modal fusion strategies. It comprises three parts, including a visual encoder, a projector, and a Mamba backbone. The visual encoder extracts the visual representation of the image, while the projector adjusts the dimensions of the visual representation to align with the Mamba language model’s tokens. The Mamba backbone consists of 64 identical basic blocks, maintaining connectivity and RMSNorm, and transforms the combined visual and textual embeddings into target token sequences in an autoregressive manner. VL-Mamba [42] comprises a pretrained visual coder, a randomly initialized MMC, and a pretrained Mamba LLM. The visual coder takes the original picture and uses the ViT architecture to create a series of patch features. Regarding the MMC, it introduces a 2D visual selective scanning mechanism tailored for computer vision tasks, the state-space model is designed for 1D sequential data with causality, while visual sequences from the visual coder are 2D non-causal data. The study explored three multimodal connector variants including MLP, VSS-MLP, and VSS-L2. Initially, input images are processed into visual characteristics by the coder. After feeding these visual series to the MMC, the output vectors that are produced are coupled with a tokenized text question and sent to the Mamba LLM, which produces the appropriate response. Through the synergistic combination of these components, the integration and processing of visual and verbal information are optimized. ReMamber [50] addresses the referential image segmentation (RIS) task, including identifying and dividing particular elements in a picture according to written descriptions. The architecture combines Mamba with multimodal Mamba Twister blocks to simulate image–text interactions explicitly through a distinctive channeling and spatial warping mechanism, therefore fusing textual and visual features. After each Mamba Twister block, ReMamber extracts intermediate features and passes them through a versatile decoder to produce the segmentation mask at the end.
To tackle the unparalleled challenge of predicting temporal and spatial dynamics for spatio-temporal forecasting in videos, the VMRNN cell [47] introduced a novel recurrent unit designed to efficiently handle spatio-temporal prediction tasks. By recognizing the challenges in processing extensive global information, the VMRNN cell integrates VSS blocks with an LSTM architecture to leverage the long-sequence modeling abilities of VSS blocks and the spatio-temporal representation capabilities of LSTM. This integration can enhance the accuracy and efficiency of spatio-temporal predictions. The model performs image-level analysis by segmenting each frame into patches, which are subsequently flattened and processed through an embedding layer. Moreover, this process enables the VMRNN layer to extract and predict spatio-temporal features effectively. HARMamba [40] builds on ViT blocks for activity recognition and achieves superior performance, lowering reducing computational and memory overhead in activity recognition tasks.

4.1.2. Low-Level Vision

In the realm of image super-resolution, meet more areas (MMA) [41] stands out as a novel model designed for super-resolution tasks. By building on the ViM block, MMA aims to enhance performance by activating a wider range of areas within images. On this basis, MMA adopts several key strategies, including adding ViM to modules in MetaFormer style, pre-training ViM on larger datasets, and employing complementary attention mechanisms. MMA comprises the following three primary modules: shallow feature extraction, deep feature extraction, and high-quality reconstruction. By leveraging the ViM module, MMA effectively models global information and further expands the activation region through attention mechanisms.
Existing restoration backbones often confront a dilemma between global receptive fields and efficient computation, hindering their application in practice, while Mamba has a lot of promise for linear complexity long-range dependency modeling, which can also offer an efficient a way to resolve the above dilemma. MambaIR [57] aimed to address the problem by introducing local enhancement and channel attention mechanisms to enhance the standard Mamba model. The methodology of the model mainly consists of three stages: shallow feature extraction, deep feature extraction, and high-quality image reconstruction. Among them, the deep feature extraction stage utilizes multiple residual state space blocks (RSSBs) for feature extraction, adding a VSS block before the channel attention block designed in previous transformer-based restoration networks. SERPENT [58] has a hierarchical architecture, processing input images in a multi-scale manner, including processing steps such as segmentation, embedding, downsampling, and upsampling, and introduces jump connections to facilitate information flow. Among them, the Serpent block is the main processing unit, consisting of multiple VSS blocks stacked on top of each other. Serpent reduces the computational effort, GPU memory demand, and model size significantly, while preserving good reconstruction quality by combining the benefits of transformers and convolutional networks. VmambaIR [32] put forward the OSS module to comprehensively and efficiently model image features from six directions. In addition, the omnidirectional selective scanning mechanism overcomes the unidirectional modeling limitation of SSMs and accomplishes thorough pattern identification and modeling by simulating the three-dimensional visual information flow.
UVM-Net [85] refers to a novel single-image defogging network architecture exhibiting effective performance by merging the long-range dependency modeling capacity of SSMs with the local feature extraction of convolutional layers. The method employs an encoder–decoder network architecture, and the critical component is the ViM block, which leverages the long-range modeling capability of SSMs through rolling the feature map over the channel domain. Differently from U-Mamba [84] and Mamba-UNet [51], with long-range dependencies on the non-channel domain, a different feature map dimension is established using the ViM block.
Images lose important frequency information under the influence of raindrops, affecting the performance of visual perception and advanced visual tasks. FreqMamba [68] is a novel image de-raining method combining Mamba modeling and frequency analysis techniques to deal with the image de-raining problem. Specifically, FreqMamba contains three branching structures, including spatial Mamba, frequency band Mamba, and Fourier global modeling. Spatial Mamba processes raw image features to extract details and correlations within the image. Frequency band Mamba employs wavelet packet transform (WPT) to decompose the input features into spectral features in different frequency bands and scan them over the frequency dimension. Fourier modeling, i.e., processing an input using Fourier transform, captures the global degradation patterns that can affect an image. Numerous tests have shown that FreqMamba works better than current state-of-the-art techniques in terms of both visual and quantitative aspects.
Image deblurring is a traditional issue in low-level computer vision, with the goal of restoring crisp, high-quality images from hazy input photographs. ALGNet [107] is an efficient image deblurring network utilizing selective state-space models (SSM) to aggregate rich and accurate features. The network consists of multiple ALGBlocks, each of which contains a CLGF module that captures local and global features and a feature aggregation module FA. The CLGF module captures long-range dependent features using a SSM and employs a channel-attention mechanism to lower local pixel forgetting and channel redundancy. Through weight calibration, the FA module highlights the significance of local features in recovery.
The efficiency of Mamba makes a significant contribution to mitigating the high computational complexity associated with training generation tasks. To address the change in generating long and diverse sequences with low latency, MambaTalk [74] implements a two-stage modeling strategy with discrete motion priors to improve the quality of gestures and employs a mamba block to enhance gesture diversity and rhythm through multimodal integration. Motion Mamba [34] was introduced to construct a motion generation model based on Mamba, leveraging an efficient hardware-aware design. Motion Mamba consists of the following two main components: a hierarchical temporal Mamba (HTM) block for temporal data handling; and for analyzing latent postures, a bidirectional spatial Mamba (BSM) block. To maintain motion consistency across frames, the HTM block employs several separate SSM modules within a U-Net architecture that is balanced. Meanwhile, the BSM block enhances the accuracy of motion generation within a temporal frame by processing latent poses bidirectionally. Diffusion state space odels (DiS) [88] substitute the conventional U-Net backbone in diffusion models with SSM. All inputs are taken into account by this system, including time, conditions, and tokens and noisy image patches. To address the oversight of spatial continuity in the scanning scheme of existing Mamba-based vision methods, Zigzag Mamba [31] was introduced as a straightforward, plug-and-play solution inspired by DiT-style approaches. Essentially, it retains the scanning scheme of plain Mamba but expands it from four to eight schemes by incorporating mirror flipping schemes, as displayed in Figure 4f. Then, Zigzag Mamba was integrated using the framework of the stochastic interpolant, forming ZigMa, to investigate the diffusion model’s scalability using high-resolution visual datasets. GAMBA [111] introduces a sequential network based on Mamba, allowing for linear sequence length scalability and context-dependent reasoning. This architecture accommodates many Gaussians for the 3D Gaussian splatting process. To deal with the issue of the increase in quadratic memory consumption with sequence length in traditional attention-based video generative diffusion models, an SSM-based diffusion model [89] was introduced for generating longer video sequences. Similarly to ViS4mer [80], the SSM-based diffusion model re-imagines the attention modules within the conventional temporal layers of video diffusion models. It can replace them with a ViM block designed to record the temporal changes of video data and an MLP to improve model performance. Moreover, this innovative approach significantly mitigates memory consumption for extended sequences.
In 3D vision tasks, the irregularity and sparsity of point cloud data present considerable hurdles. While transformers exhibit promise in exploring point cloud data, due to their strong global information modeling capability, but their computational complexity escalates significantly as the input length increases. This limitation restricts their applicability, particularly in long sequence models. SSPoint Mamba [110] employs embedded point blocks as inputs and enhances an SSM’s capacity for global modeling by rearranging the structure to produce a geometric scanning order that is more logical. Then, rearranged point tokens undergo processing via multiple Mamba blocks in order to causally represent the structure of the point cloud, showcasing effectiveness across different point cloud analysis tasks. 3DMambaComplete [83] tackles the computational complexity challenges of point cloud completion by leveraging the Mamba framework. The method involves downsampling incomplete point clouds, enhancing feature learning with a Mamba encoder, predicting and refining hyperpoints, dispersing hyperpoints to various 3D locations through learned offsets, and performing point deformation to generate complete point clouds. Structured state-space modeling optimizes shape reconstruction by predicting hyperpoints and controlling the deformation at each hyperpoint location. 3DMambaIPF [108] concentrates on denoising large-scale point cloud data. Integrating Mamba into a filtering module, Mamba-Denoise, can enable accurate and fast modeling of long sequences of point cloud features. By employing iterative point cloud filtering and loss functions, including reconstruction loss and differentiable rendering loss, it minimizes the distance between noisy and real point clouds, optimizing visual boundaries, and enhancing denoising realism. Point Cloud Mamba [109] combines local and global modeling frameworks and introduced a consistent traversal serialization (CTS) approach to convert 3D point cloud data into 1D point sequences, while preserving spatial adjacency. Moreover, it incorporates point cuing and position encoding based on spatial coordinate mapping to improve Mamba’s efficiency in processing point sequences and injecting position information. Point Mamba [44] addresses causality in SSM for point cloud data through introducing an octree-based ordering strategy. Additionally, it integrates bi-directional selective scanning mechanisms into point Mamba blocks to adjust sequence order dependency, thus enhancing its adaptability to point cloud structures.

4.2. Medical Visual Mamba

Transformers [8] have exerted a deep influence on the field of medical imaging with their ability to master complex data representations. They have led to notable advancements across various imaging modalities, including radiography [112], endoscopy [113], computed tomography (CT) [114], ultrasound images [115], and magnetic resonance imaging (MRI) [116]. However, since most medical images are high-resolution and detailed, transformer models typically require considerable computational resources, which scale quadratically with image resolution.
Recently, the medical imaging field has experienced a surge in the development of Mamba-based methodologies, particularly following the introduction of VMamba. The current section provides detailed examples of these design choices, further dividing them into 2D and 3D-based approaches based on the input type, as displayed in Table 8.

4.2.1. Two-Dimensional Medical Images

Mamba has exhibited impressive potential in 2D medical segmentation, as displayed in Figure 6. Here, we discuss in detail some methods that explore the use of mamba to model the global structure information of 2D medical segmentation.
Most of the innovative architectures that have been developed are based on U-Net, it has shown outstanding results in a variety of medical image segmentation challenges. U-Mamba [84] refers to the first extension of the mamba model to the U-Net framework for visual segmentation in biomedical imaging, addressing the challenge of long-range dependency modeling, which is captured by a hybrid CNNs-SSM block. Wu et al. introduced the High-order Vision Mamba UNet (H-vmunet) [55], an improvement on U-Mamba, which employs a high-order 2D-selective scan at each interaction to bolster the learning of local features, while minimizing the incorporation of redundant information. Shortly after their initial release, the team expanded their work by introducing the UltraLight VM-UNet [62]. This new iteration was developed through in-depth analysis of the critical factors affecting parameter efficiency within the Mamba framework. This resulted in a significantly more lightweight model with a mere 0.049M parameters and a computational efficiency of only 0.060 GFLOPs. Moreover, Mamba-UNet [51] combines the encoder–decoder architecture of U-Net with the capabilities of Mamba and maintains spatial information at different network scales based on jump connections. A Visual Mamba-based VSS block is used, which utilizes linear embedding layers and deep convolution to extract features, while downsampling and upsampling are facilitated by multiple merge operations and extension layers for comprehensive feature learning.
Both Pyramid ViT (PVT) and Swin-Unet are pioneering hierarchical designs that apply visual tasks and propose progressive shrinking pyramids and spatial-reduction attention. By drawing inspiration from PVT and Swin-Unet, Ruan et al. introduced VM-UNet [63], a foundational model for purely SSM-based segmentation in medical imaging. This model shows the capabilities of SSMs in medical image segmentation and is made up of three primary components: an encoder, a decoder, and skip connections. By building on their previous work, the team proposed VM-UNET-V2 [64]. The purpose of the visual state space (VSS) block was to obtain a greater range of contextual information. A semantics and detail infusion (SDI) mechanism was also implemented to enhance the fusion of low-level and high-level features. Mamba-HUNet [59], as another multi-scale hierarchical upsampling network, incorporates the Mamba technique. Additionally, it preserves spatial information by extracting hierarchical characteristics through the use of patch merging layers and visual state space blocks. TM-UNet [60] introduced improvements to the bottleneck layer. This architecture proposes triplet SSMs as the bottleneck layer, marking the first attempt to combine spatial and channel data using a pure SSM technique. Current Mamba-based models miss out on possible advantages, because they are mostly developed from scratch. As a new Mamba-based model tailored explicitly for medical image segmentation tasks, Swin-UMamba [61] leveraged the strengths of ImageNet-based pretraining.
Previous discussions have primarily concentrated on supervised learning methods, but other supervisory approaches have also been explored. Semi-Mamba-UNet [52] combines a visual Mamba-based U-shape encoder–decoder with a traditional CNNs-based UNet in a semi-supervised learning framework. In order to enhance feature learning, especially with regard to unlabeled data, it presents a self-supervised pixel-by-pixel contrastive learning approach that makes use of two projectors. Weak-Mamba-UNet [76] refers to a novel weakly-supervised learning structure for medical image segmentation, combining CNNs, ViT, and the VMamba. It employs a cooperative, cross-supervisory approach using pseudo labels for iterative network learning and improvement, with a focus on scribble-based annotations.
Some segmentation approaches diverge from UNet architectures. P-Mamba [117] presents a novel dual-branch framework for highly efficient left ventricle segmentation in pediatric echocardiograms. This model features an innovative DWT-based encoder branch equipped with Perona–Malik diffusion (PMD) blocks. Moreover, to bolster computational and memory efficiency, P-Mamba adopts vision mamba layers within its vision mamba encoder branch. PromptMamba [78] represents a groundbreaking integration of vision Mamba and prompt technologies, marking a significant milestone as the first model to leverage the Mamba framework for the specific task of polyp segmentation.
In addition, mamba has also expanded into research in 2D medical imaging beyond segmentation, enhancing the precision and speed of image analysis to support diagnosis and treatment planning. Classification is a vital and fundamental task in the area of analysis of medical images. Yue et al. invented Vision Mamba for this purpose, which was also known as MedMamba [65]. They created a brand-new module called Conv-SSM, which combines SSM’s long-range dependency capture with a convolutional layers’ local feature extraction. This enables efficient modeling of medical images from different modalities. Furthermore, MambaMIL [118] introduced the Sequence Reordering Mamba (SR-Mamba), a model recognizing the order and distribution of instances in long sequences, to effectively harness valuable embedded information. Since accurate and efficient clinical decision-making depends on high-quality and high-fidelity medical pictures, image reconstruction plays a critical role in improving diagnostic processes. Huang et al. [56] developed MambaMIR, a model leveraging Mamba technology for the reconstruction of medical images, alongside its advanced counterpart, MambaMIR-GAN, which incorporates generative adversarial networks. Zheng et al. introduced FDVision Mamba (FDVM-Net) [81], a frequency-domain-based structure that effectively corrects image exposure by restoring endoscopic pictures’ frequency domain, as recorded endoscopic images often suffer from exposure abnormalities. As shown in specialized areas, MD-Dose [90], a cutting-edge diffusion model based on the Mamba architecture, was designed to accurately predict radiation therapy dose distribution for thoracic cancer patients.

4.2.2. Three-Dimensional Medical Images

Three-dimensional image analysis in medical imaging enables more accurate and comprehensive diagnoses by providing a detailed view of complex anatomical structures. Gong et al. presented nnMamba [70], an innovative architecture designed for 3D medical imaging applications, which integrates local and global relationship modeling via the MICCSS (Mamba-In-Convolution with Channel-Spatial Siamese input) module. nnMamba was tested on a comprehensive benchmark of six datasets for three crucial tasks, including segmentation, classification, and landmark detection, showcasing its capability for long-range relationship modeling at channel and spatial levels.
Precise 3D segmentation results can alleviate physicians’ diagnostic workloads in disease management. SegMamba [71], a cutting-edge architecture, is the first technique to use Mamba expressly for precise 3D medical imaging segmentation. It introduced a tri-orientated Mamba (ToM) module for modeling 3D features from three directions and a gated spatial convolution (GSC) module to enhance spatial feature representation before each ToM module. By similarly employing a U-shaped architecture, LightM-UNet [86] uses the residual vision Mamba layer alone in a Mamba-only method to model large-scale spatial dependencies and extract deep semantic features in a lightweight framework. Both LMa-UNet [45] and T-Mamba [77] built upon the foundation of SegMamba, with improvements made to the Mamba block. A notable aspect of T-Mamba [77] was creating a gate selection unit to adaptively combine two features in the spatial domain with one feature in the frequency domain, whereas LMa-UNet [45] refers to its use of large windows, which outperformed small kernel-based CNNs and small window-based Transformers in local spatial modeling. This marks the first instance of incorporating frequency-based features into the vision Mamba framework. The issue of long-term temporal dependency in video scenarios was also addressed by developing a general framework called Vivim [35], built on Mamba for video vision. Based on a specifically engineered temporal Mamba block, this model effectively compresses long-term spatiotemporal data into sequences of different scales.
In terms of image registration tasks, MambaMorph [87] introduced a groundbreaking multi-modality deformable registration framework that enhances medical image analysis by combining a Mamba-based registration module with an advanced feature extractor for efficient spatial correspondence and feature learning. The VMambaMorph [83] model further enhanced its VMamba-based block by incorporating a 2D cross-scan module, redesigned to process 3D volumetric features in an efficient way.
In other domains, the Contrastive Masked Vim Autoencoder (CMViM) [82] tackles Alzheimer’s disease (AD) classification by incorporating vision Mamba (ViM) into a masked autoencoder for 3D multi-modal data reconstruction. Regarding endoscopy skill evaluation, a low-cost motion-guided dual-camera tracker [34] provided reliable endoscope tip feedback, and a Mamba-based motion-guided prediction head (MMH) merged visual tracking with historical motion data based on a SSM.

4.2.3. Challenge

Here, we explore some promising future research directions for vision Mamba in medical image analysis. Challenges include the need for pretraining on large datasets, which could enhance the interpretability of Mamba-based medical imaging approaches, as well as improve robustness against adversarial attacks. There is a need to design efficient Mamba architectures suitable for real-time medical applications and to address the challenges in deploying Mamba-based models in distributed settings.

4.3. Remote Sensing Image

The progress of remote sensing methods has sparked interest in high-resolution Earth observation, with the transformer model providing an optimal solution through its attention mechanism. Its quadratic complexity, however, presents problems with memory consumption and modeling efficiency. The SSM addresses these issues by establishing long-distance dependencies with near-linear complexity, and Mamba can further enhance efficiency through hardware optimization and time-varying parameters. Representative recent work is presented in Table 9.
By drawing inspiration from TNT, Chen et al. introduced a new Mamba-in-Mamba (MiM-ISTD) [66] architecture to enhance the efficiency of infrared tiny target detection. In the proposed approach, local patches are considered “visual sentences”, while outer Mamba is utilized to extract global information. Regarding remote sensing image classification, RSMamba [36] features a dynamic multi-path activation mechanism to improve Mamba’s capability for handling non-causal data. RS-Mamba [69] is adept at handling very-high-resolution (VHR) remote sensing images for dense prediction tasks, built on an omnidirectional selective scan module to model images from various angles comprehensively. In remote sensing research, it is challenging to classify hyperspectral images because of their high-dimensional complicated data. HSIMamba [72] was designed with a module dedicated to spatial analysis, including multiple spectral bands and three-dimensional spatial structures to take advantage of the rich multidimensional nature of the hyperspectral data and to improve the feature representation capability using linear transformations and activation functions. Through the use of forward and backward spectral dependence capture, HSIMamba uses a bi-directional processing approach that enhances the network’s capacity to represent and use spectrum information. In addition, Pan-Mamba [73] offers a novel network in the pansharpening space and modifies two essential elements, channel switching Mamba and cross-modal Mamba, both of which are skillfully designed for effective cross-modal information fusion and interchange. For the first time, ChangeMamba [54] investigated the Mamba architecture’s potential for distant sensing change detection (CD) activities. For binary change detection (BCD), semantic change detection (SCD), and building damage assessment (BDA) tasks, the MambaBCD, MambaSCD, and MambaBDA network frameworks were built. Three spatio-temporal connection modeling mechanisms were proposed to completely learn spatio-temporal features. ChangeMamba employs selective state-space modeling to capture long-range dependent features and maintains linear computational complexity while providing a visual Mamba architecture to learn global spatial context information. Semantic segmentation of remotely sensed images is crucial for geoscientific research. RS3Mamba [67] is a novel two-branch model developed for this purpose. The model incorporates visual state space (VSS) models, particularly the Mamba architecture, aiming to improve long-range relational modeling capabilities. A co-completion module (CCM) was proposed for feature fusion. The experimental results demonstrated that RS3Mamba had significant advantages over CNNs and transformer-based approaches. With an encoder–decoder architecture, Samba [75] is a revolutionary semantic segmentation system designed especially for high-resolution remote sensing images. Samba blocks act as encoders to extract multilevel semantic information, and Mamba blocks adopt SSMs for capturing global semantic information with linear computational complexity.

5. Conclusions

Mamba is gaining prominence in computer vision for its capability for managing long-range dependencies and its significant computational efficiency relative to transformers. As detailed in recent surveys, various methods have been developed to harness and investigate Mamba’s capabilities, reflecting ongoing advancements in the field.
We began by discussing the foundational concepts of SSM and Mamba architectures, followed by a comprehensive analysis of various competing methodologies across a spectrum of computer vision applications. Our survey encompassed state-of-the-art Mamba models designed explicitly for backbone architectures, high/mid-level vision, low-level vision, medical imaging, as well as remote sensing. Moreover, this survey is the first review paper on the recent developments in SSMs and Mamba-based techniques, explicitly concentrating on computer vision challenges. Our goal was to generate more interest among the vision community in utilizing the possibilities of Mamba models, finding solutions to their current limitations.

5.1. Challenges and Limitations

Currently, Mamba has some limitations that vision Mamba aims to address. The original Mamba’s one-dimensional selective scanning struggles to capture spatial information in high-dimensional visual data. While the existing methods attempt to mitigate this issue with enhanced scanning mechanisms, they are still insufficient and require further exploration to effectively retain spatial relationships within the Mamba framework. The use of multiple scanning directions and bi-directional approaches can result in significant redundancy and increased computational demands, reducing Mamba’s linear complexity advantages. Efficient computation strategies are necessary to improve performance, without excessive resource consumption. Originally designed for causal sequential data, Mamba’s selective scanning struggles with non-causal visual data, indicating a need for further refinement to adapt Mamba for visual data processing. Gradient vanishing and exploding are persistent challenges in deep learning, which is exacerbated as datasets grow larger. The Mamba architecture encounters stability issues that need addressing to bolster its robustness and reliability.
Achieving trustworthiness is an ongoing concern. Understanding Mamba’s effectiveness in visual tasks poses a significant challenge, requiring a deeper theoretical and empirical grasp of its mechanisms compared to models such as RNNs, CNNs, and ViTs. Improved interpretability will enable more effective application and optimization across diverse visual tasks. Mamba’s hidden states tend to accumulate domain-specific information, which can hinder generalization. Its reliance on 1D scanning strategies may introduce biases specific to certain domains, and current techniques often fall short in ensuring domain-agnostic processing. Enhancing Mamba’s ability to generalize and its robustness, especially in adversarial contexts, presents a critical challenge.

5.2. Future Directions

Mamba represents an exciting and emerging direction with numerous avenues for exploration. Here, we highlight several promising directions:
Innovative Scanning Mechanisms: To harness the full potential of visual Mamba, new scanning schemes are needed. These schemes should effectively address the non-causal nature of visual data and capture spatial information across multiple dimensions. Developing more sophisticated scanning mechanisms will be crucial for improving Mamba’s performance in visual tasks.
Hybrid Architectures: Combining Mamba with other architectures like transformers could mitigate some of its inherent limitations. Hybrid models that integrate Mamba with self-attention mechanisms or CNNs may leverage the strengths of each approach. However, careful design is necessary to ensure these hybrid models do not conflict with their sequence modeling methods and can effectively capture comprehensive and detailed information.
Large-Scale Models and Data Efficiency: As large models become the norm, scaling Mamba while maintaining its computational efficiency is essential. Developing large-scale Mamba models that retain their advantages in sequence modeling could lead to powerful visual foundation models. Additionally, improving data efficiency and enabling optimal performance without reliance on extensive datasets will broaden Mamba’s applicability in various tasks.
Integration with Other Methodologies: Mamba can be integrated with other methodologies, such as multi-modal information processing, diffusion models, domain generalization, and visual-language models. Exploring how Mamba can synergize with these methods will expand its utility and effectiveness in complex tasks across multiple domains.
Computation Efficiency: Enhancing the computational efficiency of Mamba models, especially for vision tasks, is a promising research direction. Developing hardware-aware algorithms tailored for visual Mamba models can reduce computational overheads, while maintaining or improving performance, making them more practical for real-world applications.

Author Contributions

Conceptualization, H.Z.; Writing—original draft preparation, H.Z., Y.Z., D.W. and Z.Y.; Visualization, L.Z. and T.C.; Writing—review and editing, Z.W. and Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Rosenblatt, F. The Perceptron, a Perceiving and Recognizing Automaton Project Para; Cornell Aeronautical Laboratory: Buffalo, NY, USA, 1957. [Google Scholar]
  2. Rosenblatt, F.; Jones, B.; Smith, T.; Brown, C.; Green, M.; Wilson, A.; Taylor, J.; White, P.; King, R.; Johnson, L. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms; Spartan Books: Washington, DC, USA, 1962; Volume 55. [Google Scholar]
  3. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  4. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 84–90. [Google Scholar] [CrossRef]
  5. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  6. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
  7. Parikh, A.P.; Täckström, O.; Das, D.; Uszkoreit, J. A decomposable attention model for natural language inference. arXiv 2016, arXiv:1606.01933. [Google Scholar]
  8. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  9. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  10. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
  11. Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
  12. Lieber, O.; Lenz, B.; Bata, H.; Cohen, G.; Osin, J.; Dalmedigos, I.; Safahi, E.; Meirom, S.; Belinkov, Y.; Shalev-Shwartz, S.; et al. Jamba: A Hybrid Transformer-Mamba Language Model. arXiv 2024, arXiv:2403.19887. [Google Scholar]
  13. Pióro, M.; Ciebiera, K.; Król, K.; Ludziejewski, J.; Jaszczur, S. Moe-mamba: Efficient selective state space models with mixture of experts. arXiv 2024, arXiv:2401.04081. [Google Scholar]
  14. Anthony, Q.; Tokpanov, Y.; Glorioso, P.; Millidge, B. BlackMamba: Mixture of Experts for State-Space Models. arXiv 2024, arXiv:2402.01771. [Google Scholar]
  15. Fu, D.Y.; Dao, T.; Saab, K.K.; Thomas, A.W.; Rudra, A.; Ré, C. Hungry hungry hippos: Towards language modeling with state space models. arXiv 2022, arXiv:2212.14052. [Google Scholar]
  16. Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
  17. Ramachandran, P.; Zoph, B.; Le, Q.V. Swish: A Self-Gated Activation Function. arXiv 2017, arXiv:1710.05941. [Google Scholar]
  18. Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
  19. Sun, Y.; Dong, L.; Huang, S.; Ma, S.; Xia, Y.; Xue, J.; Wang, J.; Wei, F. Retentive network: A Successor to Transformer for Large Language Models. arXiv 2023, arXiv:2307.08621. [Google Scholar]
  20. Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020; pp. 5156–5165. [Google Scholar]
  21. Poli, M.; Massaroli, S.; Nguyen, E.; Fu, D.Y.; Dao, T.; Baccus, S.; Bengio, Y.; Ermon, S.; Ré, C. Hyena hierarchy: Towards larger convolutional language models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 28043–28078. [Google Scholar]
  22. Romero, D.W.; Kuzina, A.; Bekkers, E.J.; Tomczak, J.M.; Hoogendoorn, M. Ckconv: Continuous kernel convolution for sequential data. arXiv 2021, arXiv:2102.02611. [Google Scholar]
  23. Zhai, S.; Talbott, W.; Srivastava, N.; Huang, C.; Goh, H.; Zhang, R.; Susskind, J. An attention free transformer. arXiv 2021, arXiv:2105.14103. [Google Scholar]
  24. Peng, B.; Alcaide, E.; Anthony, Q.; Albalak, A.; Arcadinho, S.; Cao, H.; Cheng, X.; Chung, M.; Grella, M.; GV, K.K.; et al. Rwkv: Reinventing rnns for the transformer era. arXiv 2023, arXiv:2305.13048. [Google Scholar]
  25. Tallec, C.; Ollivier, Y. Can recurrent neural networks warp time? arXiv 2018, arXiv:1804.11188. [Google Scholar]
  26. Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
  27. Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. Vmamba: Visual state space model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
  28. Yang, C.; Chen, Z.; Espinosa, M.; Ericsson, L.; Wang, Z.; Liu, J.; Crowley, E.J. PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition. arXiv 2024, arXiv:2403.17695. [Google Scholar]
  29. Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. LocalMamba: Visual State Space Model with Windowed Selective Scan. arXiv 2024, arXiv:2403.09338. [Google Scholar]
  30. Pei, X.; Huang, T.; Xu, C. EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba. arXiv 2024, arXiv:2403.09977. [Google Scholar]
  31. Hu, V.T.; Baumann, S.A.; Gui, M.; Grebenkova, O.; Ma, P.; Fischer, J.; Ommer, B. Zigma: Zigzag mamba diffusion model. arXiv 2024, arXiv:2403.13802. [Google Scholar]
  32. Shi, Y.; Xia, B.; Jin, X.; Wang, X.; Zhao, T.; Xia, X.; Xiao, X.; Yang, W. VmambaIR: Visual State Space Model for Image Restoration. arXiv 2024, arXiv:2403.11423. [Google Scholar]
  33. Li, K.; Li, X.; Wang, Y.; He, Y.; Wang, Y.; Wang, L.; Qiao, Y. Videomamba: State space model for efficient video understanding. arXiv 2024, arXiv:2403.06977. [Google Scholar]
  34. Zhang, Z.; Liu, A.; Reid, I.; Hartley, R.; Zhuang, B.; Tang, H. Motion mamba: Efficient and long sequence motion generation with hierarchical and bidirectional selective ssm. arXiv 2024, arXiv:2403.07487. [Google Scholar]
  35. Yang, Y.; Xing, Z.; Zhu, L. Vivim: A video vision mamba for medical video object segmentation. arXiv 2024, arXiv:2401.14168. [Google Scholar]
  36. Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. Rsmamba: Remote sensing image classification with state space model. arXiv 2024, arXiv:2403.19654. [Google Scholar] [CrossRef]
  37. Behrouz, A.; Santacatterina, M.; Zabih, R. MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection. arXiv 2024, arXiv:2403.19888. [Google Scholar]
  38. Li, S.; Singh, H.; Grover, A. Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data. arXiv 2024, arXiv:2402.05892. [Google Scholar]
  39. Patro, B.N.; Agneeswaran, V.S. SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series. arXiv 2024, arXiv:2403.15360. [Google Scholar]
  40. Li, S.; Zhu, T.; Duan, F.; Chen, L.; Ning, H.; Wan, Y. HARMamba: Efficient Wearable Sensor Human Activity Recognition Based on Bidirectional Selective SSM. arXiv 2024, arXiv:2403.20183. [Google Scholar]
  41. Cheng, C.; Wang, H.; Sun, H. Activating Wider Areas in Image Super-Resolution. arXiv 2024, arXiv:2403.08330. [Google Scholar]
  42. Qiao, Y.; Yu, Z.; Guo, L.; Chen, S.; Zhao, Z.; Sun, M.; Wu, Q.; Liu, J. VL-Mamba: Exploring State Space Models for Multimodal Learning. arXiv 2024, arXiv:2403.13600. [Google Scholar]
  43. Chen, G.; Huang, Y.; Xu, J.; Pei, B.; Chen, Z.; Li, Z.; Wang, J.; Li, K.; Lu, T.; Wang, L. Video mamba suite: State space model as a versatile alternative for video understanding. arXiv 2024, arXiv:2403.09626. [Google Scholar]
  44. Liu, J.; Yu, R.; Wang, Y.; Zheng, Y.; Deng, T.; Ye, W.; Wang, H. Point mamba: A novel point cloud backbone based on state space model with octree-based ordering strategy. arXiv 2024, arXiv:2403.06467. [Google Scholar]
  45. Wang, J.; Chen, J.; Chen, D.; Wu, J. Large Window-based Mamba UNet for Medical Image Segmentation: Beyond Convolution and Self-attention. arXiv 2024, arXiv:2403.07332. [Google Scholar]
  46. Zhang, Y.; Yan, W.; Yan, K.; Lam, C.P.; Qiu, Y.; Zheng, P.; Tang, R.S.Y.; Cheng, S.S. Motion-Guided Dual-Camera Tracker for Low-Cost Skill Evaluation of Gastric Endoscopy. arXiv 2024, arXiv:2403.05146. [Google Scholar]
  47. Tang, Y.; Dong, P.; Tang, Z.; Chu, X.; Liang, J. VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting. arXiv 2024, arXiv:2403.16536. [Google Scholar]
  48. Chen, C.S.; Chen, G.Y.; Zhou, D.; Jiang, D.; Chen, D.S. Res-VMamba: Fine-Grained Food Category Visual Classification Using Selective State Space Models with Deep Residual Learning. arXiv 2024, arXiv:2402.15761. [Google Scholar]
  49. Wan, Z.; Wang, Y.; Yong, S.; Zhang, P.; Stepputtis, S.; Sycara, K.; Xie, Y. Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation. arXiv 2024, arXiv:2404.04256. [Google Scholar]
  50. Yang, Y.; Ma, C.; Yao, J.; Zhong, Z.; Zhang, Y.; Wang, Y. ReMamber: Referring Image Segmentation with Mamba Twister. arXiv 2024, arXiv:2403.17839. [Google Scholar]
  51. Wang, Z.; Zheng, J.Q.; Zhang, Y.; Cui, G.; Li, L. Mamba-unet: Unet-like pure visual mamba for medical image segmentation. arXiv 2024, arXiv:2402.05079. [Google Scholar]
  52. Ma, C.; Wang, Z. Semi-Mamba-UNet: Pixel-Level Contrastive and Pixel-Level Cross-Supervised Visual Mamba-based UNet for Semi-Supervised Medical Image Segmentation. arXiv 2024, arXiv:2402.07245. [Google Scholar]
  53. Wang, Z.; Zheng, J.Q.; Ma, C.; Guo, T. VMambaMorph: A Visual Mamba-based Framework with Cross-Scan Module for Deformable 3D Image Registration. arXiv 2024, arXiv:2404.05105. [Google Scholar]
  54. Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. ChangeMamba: Remote Sensing Change Detection with Spatio-Temporal State Space Model. arXiv 2024, arXiv:2404.03425. [Google Scholar] [CrossRef]
  55. Wu, R.; Liu, Y.; Liang, P.; Chang, Q. H-vmunet: High-order Vision Mamba UNet for Medical Image Segmentation. arXiv 2024, arXiv:2403.13642. [Google Scholar]
  56. Huang, J.; Yang, L.; Wang, F.; Wu, Y.; Nan, Y.; Aviles-Rivero, A.I.; Schönlieb, C.B.; Zhang, D.; Yang, G. MambaMIR: An Arbitrary-Masked Mamba for Joint Medical Image Reconstruction and Uncertainty Estimation. arXiv 2024, arXiv:2402.18451. [Google Scholar]
  57. Guo, H.; Li, J.; Dai, T.; Ouyang, Z.; Ren, X.; Xia, S.T. MambaIR: A Simple Baseline for Image Restoration with State-Space Model. arXiv 2024, arXiv:2402.15648. [Google Scholar]
  58. Shahab Sepehri, M.; Fabian, Z.; Soltanolkotabi, M. Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models. arXiv 2024, arXiv:2403.17902. [Google Scholar]
  59. Sanjid, K.S.; Hossain, M.T.; Junayed, M.S.S.; Uddin, D.M.M. Integrating Mamba Sequence Model and Hierarchical Upsampling Network for Accurate Semantic Segmentation of Multiple Sclerosis Legion. arXiv 2024, arXiv:2403.17432. [Google Scholar]
  60. Tang, H.; Cheng, L.; Huang, G.; Tan, Z.; Lu, J.; Wu, K. Rotate to Scan: UNet-like Mamba with Triplet SSM Module for Medical Image Segmentation. arXiv 2024, arXiv:2403.17701. [Google Scholar]
  61. Liu, J.; Yang, H.; Zhou, H.Y.; Xi, Y.; Yu, L.; Yu, Y.; Liang, Y.; Shi, G.; Zhang, S.; Zheng, H.; et al. Swin-umamba: Mamba-based unet with imagenet-based pretraining. arXiv 2024, arXiv:2402.03302. [Google Scholar]
  62. Wu, R.; Liu, Y.; Liang, P.; Chang, Q. UltraLight VM-UNet: Parallel Vision Mamba Significantly Reduces Parameters for Skin Lesion Segmentation. arXiv 2024, arXiv:2403.20035. [Google Scholar]
  63. Ruan, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar]
  64. Zhang, M.; Yu, Y.; Gu, L.; Lin, T.; Tao, X. VM-UNET-V2 Rethinking Vision Mamba UNet for Medical Image Segmentation. arXiv 2024, arXiv:2403.09157. [Google Scholar]
  65. Yue, Y.; Li, Z. MedMamba: Vision Mamba for Medical Image Classification. arXiv 2024, arXiv:2403.03849. [Google Scholar]
  66. Chen, T.; Tan, Z.; Gong, T.; Chu, Q.; Wu, Y.; Liu, B.; Ye, J.; Yu, N. MiM-ISTD: Mamba-in-Mamba for Efficient Infrared Small Target Detection. arXiv 2024, arXiv:2403.02148. [Google Scholar]
  67. Ma, X.; Zhang, X.; Pun, M.O. RS3Mamba: Visual State Space Model for Remote Sensing Images Semantic Segmentation. arXiv 2024, arXiv:2404.02457. [Google Scholar] [CrossRef]
  68. Zhen, Z.; Hu, Y.; Feng, Z. FreqMamba: Viewing Mamba from a Frequency Perspective for Image Deraining. arXiv 2024, arXiv:2404.09476. [Google Scholar]
  69. Zhao, S.; Chen, H.; Zhang, X.; Xiao, P.; Bai, L.; Ouyang, W. RS-Mamba for Large Remote Sensing Image Dense Prediction. arXiv 2024, arXiv:2404.02668. [Google Scholar]
  70. Gong, H.; Kang, L.; Wang, Y.; Wan, X.; Li, H. nnmamba: 3d biomedical image segmentation, classification and landmark detection with state space model. arXiv 2024, arXiv:2402.03526. [Google Scholar]
  71. Xing, Z.; Ye, T.; Yang, Y.; Liu, G.; Zhu, L. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. arXiv 2024, arXiv:2401.13560. [Google Scholar]
  72. Yang, J.X.; Zhou, J.; Wang, J.; Tian, H.; Liew, A.W.C. Hsimamba: Hyperpsectral imaging efficient feature learning with bidirectional state space for classification. arXiv 2024, arXiv:2404.00272. [Google Scholar]
  73. He, X.; Cao, K.; Yan, K.; Li, R.; Xie, C.; Zhang, J.; Zhou, M. Pan-Mamba: Effective pan-sharpening with State Space Model. arXiv 2024, arXiv:2402.12192. [Google Scholar]
  74. Xu, Z.; Lin, Y.; Han, H.; Yang, S.; Li, R.; Zhang, Y.; Li, X. MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models. arXiv 2024, arXiv:2403.09471. [Google Scholar]
  75. Zhu, Q.; Cai, Y.; Fang, Y.; Yang, Y.; Chen, C.; Fan, L.; Nguyen, A. Samba: Semantic Segmentation of Remotely Sensed Images with State Space Model. arXiv 2024, arXiv:2404.01705. [Google Scholar]
  76. Wang, Z.; Ma, C. Weak-Mamba-UNet: Visual Mamba Makes CNN and ViT Work Better for Scribble-based Medical Image Segmentation. arXiv 2024, arXiv:2402.10887. [Google Scholar]
  77. Hao, J.; He, L.; Hung, K.F. T-Mamba: Frequency-Enhanced Gated Long-Range Dependency for Tooth 3D CBCT Segmentation. arXiv 2024, arXiv:2404.01065. [Google Scholar]
  78. Xie, J.; Liao, R.; Zhang, Z.; Yi, S.; Zhu, Y.; Luo, G. ProMamba: Prompt-Mamba for polyp segmentation. arXiv 2024, arXiv:2403.13660. [Google Scholar]
  79. Zubić, N.; Gehrig, M.; Scaramuzza, D. State Space Models for Event Cameras. arXiv 2024, arXiv:2402.15584. [Google Scholar]
  80. Islam, M.M.; Bertasius, G. Long movie clip classification with state-space video models. In Proceedings of the European Conference on Computer Vision. Springer, Glasgow, UK, 23–28 August 2022; pp. 87–104. [Google Scholar]
  81. Zheng, Z.; Zhang, J. FD-Vision Mamba for Endoscopic Exposure Correction. arXiv 2024, arXiv:2402.06378. [Google Scholar]
  82. Yang, G.; Du, K.; Yang, Z.; Du, Y.; Zheng, Y.; Wang, S. CMViM: Contrastive Masked Vim Autoencoder for 3D Multi-modal Representation Learning for AD classification. arXiv 2024, arXiv:2403.16520. [Google Scholar]
  83. Li, Y.; Yang, W.; Fei, B. 3DMambaComplete: Exploring Structured State Space Model for Point Cloud Completion. arXiv 2024, arXiv:2404.07106. [Google Scholar]
  84. Ma, J.; Li, F.; Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
  85. Zheng, Z.; Wu, C. U-shaped Vision Mamba for Single Image Dehazing. arXiv 2024, arXiv:2402.04139. [Google Scholar]
  86. Liao, W.; Zhu, Y.; Wang, X.; Pan, C.; Wang, Y.; Ma, L. Lightm-unet: Mamba assists in lightweight unet for medical image segmentation. arXiv 2024, arXiv:2403.05246. [Google Scholar]
  87. Guo, T.; Wang, Y.; Meng, C. Mambamorph: A mamba-based backbone with contrastive feature learning for deformable mr-ct registration. arXiv 2024, arXiv:2401.13934. [Google Scholar]
  88. Fei, Z.; Fan, M.; Yu, C.; Huang, J. Scalable Diffusion Models with State Space Backbone. arXiv 2024, arXiv:2402.05608. [Google Scholar]
  89. Oshima, Y.; Taniguchi, S.; Suzuki, M.; Matsuo, Y. SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces. arXiv 2024, arXiv:2403.07711. [Google Scholar]
  90. Fu, L.; Li, X.; Cai, X.; Wang, Y.; Wang, X.; Shen, Y.; Yao, Y. MD-Dose: A Diffusion Model based on the Mamba for Radiotherapy Dose Prediction. arXiv 2024, arXiv:2403.08479. [Google Scholar]
  91. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
  92. Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.c. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Adv. Neural Inf. Process. Syst. 2015, 28, 802–810. [Google Scholar]
  93. Li, W.; Hong, X.; Fan, X. SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding. arXiv 2024, arXiv:2404.01174. [Google Scholar]
  94. Radosavovic, I.; Kosaraju, R.P.; Girshick, R.; He, K.; Dollár, P. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10428–10436. [Google Scholar]
  95. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  96. Zhang, P.; Dai, X.; Yang, J.; Xiao, B.; Yuan, L.; Zhang, L.; Gao, J. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2998–3008. [Google Scholar]
  97. Yang, J.; Li, C.; Zhang, P.; Dai, X.; Xiao, B.; Yuan, L.; Gao, J. Focal self-attention for local-global interactions in vision transformers. arXiv 2021, arXiv:2107.00641. [Google Scholar]
  98. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  99. He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; Li, M. Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 558–567. [Google Scholar]
  100. Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
  101. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International conference on machine learning, PMLR, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
  102. Chen, Z.; Duan, Y.; Wang, W.; He, J.; Lu, T.; Dai, J.; Qiao, Y. Vision transformer adapter for dense predictions. arXiv 2022, arXiv:2205.08534. [Google Scholar]
  103. Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
  104. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
  105. Touvron, H.; Cord, M.; Jégou, H. Deit iii: Revenge of the vit. In Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part XXIV; Springer: Cham, Switzerland, 2022; pp. 516–533. [Google Scholar]
  106. Zhao, H.; Zhang, M.; Zhao, W.; Ding, P.; Huang, S.; Wang, D. Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference. arXiv 2024, arXiv:2403.14520. [Google Scholar]
  107. Gao, H.; Dang, D. Aggregating Local and Global Features via Selective State Spaces Model for Efficient Image Deblurring. arXiv 2024, arXiv:2403.20106. [Google Scholar]
  108. Zhou, Q.; Yang, W.; Fei, B.; Xu, J.; Zhang, R.; Liu, K.; Luo, Y.; He, Y. 3DMambaIPF: A State Space Model for Iterative Point Cloud Filtering via Differentiable Rendering. arXiv 2024, arXiv:2404.05522. [Google Scholar]
  109. Zhang, T.; Li, X.; Yuan, H.; Ji, S.; Yan, S. Point Could Mamba: Point Cloud Learning via State Space Model. arXiv 2024, arXiv:2403.00762. [Google Scholar]
  110. Liang, D.; Zhou, X.; Wang, X.; Zhu, X.; Xu, W.; Zou, Z.; Ye, X.; Bai, X. PointMamba: A Simple State Space Model for Point Cloud Analysis. arXiv 2024, arXiv:2402.10739. [Google Scholar]
  111. Shen, Q.; Yi, X.; Wu, Z.; Zhou, P.; Zhang, H.; Yan, S.; Wang, X. Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction. arXiv 2024, arXiv:2403.18795. [Google Scholar]
  112. Seeram, E. Digital Radiography: Physical Principles and Quality Control; Springer: Singapore, 2019. [Google Scholar]
  113. Lui, R.N.; Wong, S.H.; Sánchez-Luna, S.A.; Pellino, G.; Bollipo, S.; Wong, M.Y.; Chiu, P.W.; Sung, J.J. Overview of guidance for endoscopy during the coronavirus disease 2019 pandemic. J. Gastroenterol. Hepatol. 2020, 35, 749–759. [Google Scholar] [CrossRef]
  114. Withers, P.J.; Bouman, C.; Carmignato, S.; Cnudde, V.; Grimaldi, D.; Hagen, C.K.; Maire, E.; Manley, M.; Du Plessis, A.; Stock, S.R. X-ray computed tomography. Nat. Rev. Methods Prim. 2021, 1, 18. [Google Scholar] [CrossRef]
  115. Christensen-Jeffries, K.; Couture, O.; Dayton, P.A.; Eldar, Y.C.; Hynynen, K.; Kiessling, F.; O’Reilly, M.; Pinton, G.F.; Schmitz, G.; Tang, M.X.; et al. Super-resolution ultrasound imaging. Ultrasound Med. Biol. 2020, 46, 865–891. [Google Scholar] [CrossRef]
  116. Tiwari, A.; Srivastava, S.; Pant, M. Brain tumor segmentation and classification from magnetic resonance images: Review of selected methods from 2014 to 2019. Pattern Recognit. Lett. 2020, 131, 244–260. [Google Scholar] [CrossRef]
  117. Ye, Z.; Chen, T. P-Mamba: Marrying Perona Malik Diffusion with Mamba for Efficient Pediatric Echocardiographic Left Ventricular Segmentation. arXiv 2024, arXiv:2402.08506. [Google Scholar]
  118. Yang, S.; Wang, Y.; Chen, H. MambaMIL: Enhancing Long Sequence Modeling with Sequence Reordering in Computational Pathology. arXiv 2024, arXiv:2403.06800. [Google Scholar]
Figure 1. The number of SSM and Mamba papers published (from year 2021 to year March 2024). (a) SSM-based papers, (b) Mamba-based papers on vision.
Figure 1. The number of SSM and Mamba papers published (from year 2021 to year March 2024). (a) SSM-based papers, (b) Mamba-based papers on vision.
Applsci 14 05683 g001
Figure 2. Graphical representation of Mamba Block [11].
Figure 2. Graphical representation of Mamba Block [11].
Applsci 14 05683 g002
Figure 3. Graphical representation of a discretized SSM.
Figure 3. Graphical representation of a discretized SSM.
Applsci 14 05683 g003
Figure 4. Comparison between different 2D scans and the selective scan orders in Vim (a) [26], VMamba (b) [27], PlainMamba (c) [28], LocalMamba (d) [29], Efficient VMamba (e) [30], Zigzag (f) [31], VmambaIR (g) [32], VideoMamba (h) [33], Motion Mamba (i) [34], Vivim (j) [35] and RSMamba (k) [36].
Figure 4. Comparison between different 2D scans and the selective scan orders in Vim (a) [26], VMamba (b) [27], PlainMamba (c) [28], LocalMamba (d) [29], Efficient VMamba (e) [30], Zigzag (f) [31], VmambaIR (g) [32], VideoMamba (h) [33], Motion Mamba (i) [34], Vivim (j) [35] and RSMamba (k) [36].
Applsci 14 05683 g004aApplsci 14 05683 g004b
Figure 5. Graphical representation of the architecture and element functions of the ViM Block and VSS Block. (a) ViM block (b) VSS block.
Figure 5. Graphical representation of the architecture and element functions of the ViM Block and VSS Block. (a) ViM block (b) VSS block.
Applsci 14 05683 g005
Figure 6. An overview of Mamba models used for segmentation task in 2D medical images.
Figure 6. An overview of Mamba models used for segmentation task in 2D medical images.
Applsci 14 05683 g006
Table 7. Representative works of general visual mamba.
Table 7. Representative works of general visual mamba.
CategorySub-CategoryMethodEfficiencyCode
BackboneVisual MambaVision Mamba [26]Params
Vim-Ti: 7, Vim-S: 26
VMamba [27]FLOPs Base: 15.2
Small: 9.1, Tiny: 4.5
PlainMamba [28]FLOPs
PlainMamba-L1: 3.0
PlainMamba-L2: 8.1
PlainMamba-L3: 14.4
LocalMamba [29]FLOPs
LocalVMamba-T: 5.7
LocalVMamba-S: 11.4
Mamba-ND [38]Params Mamba-2D: 24
Mamba-3D: 36
SiMBA [39]-
RES-VMAMBA [48]-
Efficient MambaEfficientVMamba [30]-
MambaMixer [37]-
High/Mid-level visionObject detectionSSM-ViT [79]Params 17.5
SegmentationReMamber [50]-
Sigma [49]-
Video classificationViS4mer [80]Memory 5273.6
Video understandingVideo Mamba Suite [43]-
VideoMamba [33]FLOPs
VideoMamba-Ti: 7.1
VideoMamba-S: 28
VideoMamba-M: 83.1
SpikeMba [93]-
Multi-Modal
understanding
Cobra [106]-
ReMamber [50]-
VL-Mamba [42]-
Video predictionVMRNN [47]Params 2.6, FLOPs 0.9
HARMamba [40]FLOPs
PAMAP2:279.21
UCI:237.83
UNIMIB HAR:238.36
WISDM:256.52
Low-level visionImage super-resolutionMMA [41]-
Image restorationMambaIR [57]Params 16.7
SERPENT [58]-
VmambaIR [32]Params 10.50, FLOPs 20.5
Image dehazingUVM-Net [85]Params 19.25
Image derainFreqMamba [68]Params 14.52
Image deblurringALGNet [107]FLOPs 17
Visual generationMambaTalk [74]-
Motion Mamba [34]-
DiS [88]-
ZigMa [31]-
Point cloud3DMambaComplete [83]Params 34.06, FLOPs 7.12
3DMambaIPF [108]-
Point Cloud Mamba [109]Params 34.2, FLOPs 45.0
POINT MAMBA [44]Memory 8550
SSPointMamba [110]Params 12.3, FLOPs 3.6
3D reconstructionGAMBA [111]-
Video generationSSM-based diffusion model [89]-
For the efficiency, inference speed is in ms, memory is in MB, Params is in M, and FLOPS is in G.
Table 8. Representative works of medical visual mamba.
Table 8. Representative works of medical visual mamba.
CategorySub-CategoryMethodEfficiencyCode
2DSegmentationMamba-UNet [51]-
H-vmunet [55]Memory 0.676
Params 8.97
Mamba-HUNet [59]-
P-Mamba [117]Inference speed 23.49
Memory 12.22
Params 183.37
FLOPs 71.81× 10 9
ProMamba [78]Params 102
TM-UNet [60]Params 14.86
Total Params 8.41
FLOPs 3.42
Semi-Mamba-UNet [52]-
Swin-UMamba [61]Params 28
FLOPs 18.9
UltraLight VM-UNet [62]Params 0.049
GFLOPs 0.060
U-Mamba [84]-
VM-UNet [63]Params 34.62
FLOPs 7.56
FPS 20.612
VM-UNET-V2 [64]Params 17.91
FLOPS 4.40
FPS 32.58
Weak-Mamba-UNet [76]-
Radiation dose predictionMD-Dose [90]Inference speed 18
Params 30.47
ClassificationMedMamba [65]-
MambaMIL [118]-
Image reconstructionMambaMIR
/MambaMIR-GAN [56]
-
Exposure correctionFDVM-Net [81]Inference speed 22.95
3DSegmentationLMa-UNet [45]-
LightM-UNet [86]Params 1.87
FLOPs 457.62 × 10 9
SegMamba [71]Inference speed 151
T-Mamba [77]-
Vivim [35]FPS 35.33
ClassificationCMViM [82]Params 50
Motion trackingMotion-Guided Dual-Camera
Tracker [46]
-
BackbonennMamba [70]Params 15.55
FLOPs 141.14
Image registrationVMambaMorph [53]Inference speed 19
Memory 3.93
Params 9.64
MambaMorph [87]Inference speed 27
Memory 7.60
Params 7.59
For efficiency, inference speed is in ms, memory is in Gb, Params is in M, and FLOPS is in G.
Table 9. Representative mamba work in remote sensing image.
Table 9. Representative mamba work in remote sensing image.
CategoryMethodHighlightEfficiencyCode
Pan-sharpeningPan-Mamba [73]channel swapping Mamba;
cross-modal Mamba
Params 0.1827
FLOPs 3.0088
Infrared Small
Target Detection
MIM-ISTD [66]Mamba-in-Mamba
architecture
Params 1.16
FLOPs 1.01
Inference speed 30
Memory 1774
ClassificationRSMamba [36]multi-path activation-
HSIMamba [72]process data bidirectionallyMemory 136.53
Image dense predictionRS-Mamba [69]omnidirectional selective scan-
Change detectionChangeMamba [54]cross-scan mechanism-
Semantic segmentationRS3Mamba [67]dual-branch networkFLOPs 31.65
Params 43.32
Memory 2332
Samba [75]encoder-decoder architectureParams 51.9
For the Efficiency, Inference speed is in ms, Memory is in MB, Params is in M, and FLOPS is in G.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, H.; Zhu, Y.; Wang, D.; Zhang, L.; Chen, T.; Wang, Z.; Ye, Z. A Survey on Visual Mamba. Appl. Sci. 2024, 14, 5683. https://doi.org/10.3390/app14135683

AMA Style

Zhang H, Zhu Y, Wang D, Zhang L, Chen T, Wang Z, Ye Z. A Survey on Visual Mamba. Applied Sciences. 2024; 14(13):5683. https://doi.org/10.3390/app14135683

Chicago/Turabian Style

Zhang, Hanwei, Ying Zhu, Dan Wang, Lijun Zhang, Tianxiang Chen, Ziyang Wang, and Zi Ye. 2024. "A Survey on Visual Mamba" Applied Sciences 14, no. 13: 5683. https://doi.org/10.3390/app14135683

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop