RegMamba: An Improved Mamba for Medical Image Registration

Hu, Xin; Chen, Jiaqi; Chen, Yilin

doi:10.3390/electronics13163305

Open AccessArticle

RegMamba: An Improved Mamba for Medical Image Registration

by

Xin Hu

¹,

Jiaqi Chen

¹ and

Yilin Chen

^1,2,*

¹

School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan 430073, China

²

Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan 430073, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3305; https://doi.org/10.3390/electronics13163305 (registering DOI)

Submission received: 20 July 2024 / Revised: 11 August 2024 / Accepted: 16 August 2024 / Published: 20 August 2024

(This article belongs to the Special Issue Application of Machine Learning in Graphics and Images, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Deformable medical image registration aims to minimize the differences between fixed and moving images to provide comprehensive physiological or structural information for further medical analysis. Traditional learning-based convolutional network approaches usually suffer from the problem of perceptual limitations, and in recent years, the Transformer architecture has gained popularity for its superior long-range relational modeling capabilities, but still faces severe computational challenges in handling high-resolution medical images. Recently, selective state-space models have shown great potential in the vision domain due to their fast inference and efficient modeling. Inspired by this, in this paper, we propose RegMamba, a novel medical image registration architecture that combines convolutional and state-space models (SSMs), designed to efficiently capture complex correspondence in registration while maintaining efficient computational effort. Firstly our model introduces Mamba to efficiently remotely model and process potential dependencies of the data to capture large deformations. At the same time, we use a scaled convolutional layer in Mamba to alleviate the problem of spatial information loss in 3D data flattening processing in Mamba. Then, a deformable convolutional residual module (DCRM) is proposed to adaptively adjust the sampling position and process deformations to capture more flexible spatial features while learning fine-grained features of different anatomical structures to construct local correspondences and improve model perception. We demonstrate the advanced registration performance of our method on the LPBA40 and IXI public datasets.

Keywords:

medical image registration; Mamba; U-Net

1. Introduction

Deformable medical image registration is a technique commonly used to align medical images by estimating non-rigid deformation fields that establish nonlinear spatial correspondences between different images in order to minimize the difference between moving and fixed images. Compared with rigid registration, it can better process the deformation of organs and tissues and provide more accurate image alignment. Such accuracy is essential for the thorough analysis and practical utilization of medical images. For instance, in surgical navigation, deformable registration can help doctors pinpoint the target area, thereby significantly improving the accuracy and safety of surgery. In tumor detection, multi-temporal image registration enables the monitoring of tumor size and location changes to assess treatment effects.

Traditional medical image registration methods [1,2,3,4] usually generate the deformation field by iteratively optimizing the energy function. Specifically, these methods first establish an initial correspondence between image pairs, and then continuously adjust the deformation field to minimize the degree of discrepancy between the moving image and the fixed image. Although these methods perform well, the iterative optimization process needs to optimize each image pair from scratch, which usually consumes a lot of computational resources and time, leading to limitations in real-time applications. Moreover, such methods are usually insensitive to the noise and artifacts present in medical images, making the extraction of features less precise and affecting the final registration.

To address these challenges, learning-based registration methods [5,6,7,8,9,10,11,12] have attracted much attention and sustained research over the past decade. For instance, VoxelMorph [5], a pioneering deep learning-based medical image registration method designed for unsupervised and weakly supervised learning, which learns complex deformation fields directly from the data via convolutional neural networks without the need for an iterative optimization process. This not only significantly reduces the computational cost and time, but also improves the flexibility and robustness of the registration, and its accuracy is comparable to that of traditional methods. Based on this work, a large number of follow-up studies and improvements have emerged, such as RegNet [10], VTN [13], and DIRNet [14], which promote the further development and application of medical image registration techniques. However, the inherent limitations (constrained by the receptive field) that the CNN structure has make it difficult to model the potential long-range dependence in registration.

In recent years, Transformers have shown superiority in many visual tasks [15,16,17] due to their long-range relational modeling capabilities. Inspired by this, many researchers have introduced Transformers to cope with this challenge. For example, DTN [18] (Dual Transformer Network) utilizes a self-attention mechanism to achieve accurate registration; ViT-V-Net [19] combines Visual Transformers (ViT) and classical convolutional networks (U-Net) to achieve the interaction of global information by updating the labeled features; and TransMorph [6] integrates Swin Transformers into the framework and constructs information intersection across the windows through the mechanism of sliding windows, which significantly reduces the overhead of attention computation. Benefiting from the powerful remote modeling capability of its core self-attention mechanism, this combination is able to capture more global features and correlations, and show excellent performance. However, due to the fully connected nature of self-attention, it leads to a significant computational cost, making it unfriendly for handling large-scale high-resolution medical image tasks. In recent times, Mamba [20], as a state-space model, has made impressive achievements in the field of natural language processing by introducing time-varying parameters and designing hardware-aware algorithms. It has subsequently attracted much attention in the field of computer vision due to its advantages of fast inference capability and linear complexity of global modeling. Based on this model, the Mamba architecture was extended to the domain of medical image analysis and showed promising results. However, most studies have simply modified the Mamba model to capture global features, thus somewhat ignoring the importance of local feature representation for registration tasks. Therefore, the development of an effective Mamba-based model with local and global information is valuable for advancing research in medical image registration.

In this paper, we propose a simple but effective model, RegMamba, that combines global and local features to deal with medical image registration tasks. Specifically, we enable the effective combination of convolution and Mamba combination by utilizing a scaled convolutional Mamba (SCM) and a deformable convolutional residual model (DCRM) to achieve high-precision registration. Unlike other methods, we incorporate a localized design that aims to efficiently learn large and local deformations in images to alleviate the challenge of a lack of detailed features and spatial modeling capabilities in existing Mamba-based methods for registration. In addition, in order for the generated deformation field to ensure smoothness and maintain topology, we introduce a differential homomorphism topology. The main contributions of this work are as follows:

A novel registration network, RegMamba, is proposed. By combining the global feature capturing ability of the Mamba module with the local feature extraction advantage of CNN, it not only captures the subtle local changes in the image, but also provide global relationship modeling, thus achieving more efficient and accurate image registration.
We insert a scaling convolution module into the Mamba module to adaptively scale the local features to alleviate the spatial information that may be lacking in Mamba’s processing of three-bit data.
The deformable convolutional residual module (DCRM) is able to adaptively adjust the position and shape of the convolution kernel according to the characteristics of the input data to extract clearer and more discriminative local variation features.

This article is structured as follows; Section 2 describes related work, Section 3 describes in detail the process and methodology of our registration, Section 4 reports the details and results of our experiments, and Section 5 gives our conclusions.

2. Related Work

The application of medical image registration has been extensively studied over the years [21]. In this section, we focus on reviewing the approaches to addressing the challenges of registration and some of the knowledge related to the methods we have used.

2.1. Prior Works on Medical Image Registration

Deformable medical image registration aims to maximize the similarity between two images while keeping the deformation smooth and physically reasonable. Traditional methods are mainly divided into two categories: parametric model-based and non-parametric model-based. Parametric model-based methods usually preset a set of fixed parameters, and complete image registration by learning these parameters to fit the data. Among the classical methods are B-Splines [22] and Thin-Plate Splines [23]. Non-parametric model-based methods, in contrast, do not predefine any parameters and instead learn the deformation field directly from the data. They are usually more flexible and suitable for dealing with more complex data structures. Some of the classical methods include SyN [1], NiftyReg [4], LDDMM [2], and Demons [24,25]. However, all of these traditional methods usually learn non-rigid deformation relations between images by minimizing the energy function over multiple iterations, which inevitably leads to computationally intensive and long runtimes.

With the development of deep learning techniques and the optimization of algorithms, deep learning-based methods, especially convolutional neural networks (CNNs), are widely used in registration tasks. Their ability to learn a generalized representation of registration in the training stage enables the fast registration of trained unseen image pairs. They are usually categorized into two types: supervised methods and unsupervised methods. The former requires ground truth to be engaged in the training of the neural network model, and the accuracy of the registration is highly dependent on the quality of the ground truth. However, ground truth is difficult and expensive to obtain, and is usually generated by traditional methods, e.g., Quicksilver [12], or integrated synthesis, e.g., RegNet [10]. Because of this limitation, unsupervised methods that do not require ground truth have become the focus of development, e.g., VoxelMorph [5], CycleMorph [8], DIRNet [14], and others. These methods have demonstrated good registration performance.

However, almost all of the above early work based on deep learning used U-Net or U-Net variants, thus the limited receptive field made it difficult for them to capture sufficiently complex deformations. For this reason, some research has worked on developing multiple networks to increase the receptive field and expressive power of the model. For example, Zhao et al. [26] proposed recursive cascade networks that use multiple VoxelMorph networks to progressively distort moving images for better fitting, Dual-PRNet employs a dual-stream encoder–decoder network to compute a multi-scale registration field, and PCNet proposes a dual-encoder U-Net to decompose a single complex deformation field from coarse to fine. Concurrently, some researchers have explored the introduction of Transformers to handle large deformations. In addition to the previously mentioned DTN, ViT-V-Net, and TransMorph, XMorpher [9] designed dual parallel encoder networks to introduce cross-attention to extract inter-image correspondences, Deformer [27] used a Transformer module to predict the displacement field through a multi-scale framework in a coarse-to-fine manner, and DAVoxelMorph [28] added dual attention in VoxelMorph to model semantic associations in spatial and coordinate dimensions.

Although these methods are effective, multiple networks or Transformers can lead to significant computational costs. Considering this issue, Mamba has recently been used as an alternative to Transformer, but its linear approach to processing data may not be fully applicable to medical images with multiple anatomical structures. In this paper, we use a novel approach to efficiently integrate Mamba into a registration network framework to address the computational challenges while capturing complex deformations more effectively.

2.2. Mamba in Vision Tasks

Mamba [20] is an emerging state-space model (SSM) that improves the S4 model by introducing time-varying parameters and designing hardware-aware algorithms. Due to the unique advantages of fast inference and linear complexity in dealing with long sequences, it has recently attracted much attention in the computer field and has been considered to be a favorable choice for long-range relational modeling. With the deepening of research, Mamba has been extended to be applied to several visual fields and demonstrated its strong adaptability and superior performance in various complex tasks.

Vim [29] is the earliest model that applies mamba to the vision domain; it utilizes position embedding to obtain spatial location information in addition and has been proposed for use in bi-directional SSM to deal with non-causal sequence relationships in images. Almost at the same time, VMamba [30] was also proposed, which uses a strategy of traversing the image in four directions to solve the problem of the insensitivity of spatial location information in vision tasks. Subsequently, several Mamba-based visual backbones have been intensively studied. PlainMamba [31] was designed with a non-hierarchical structure to facilitate the fusion of features at different scales in order to enhance multi-scale integration. LocalMamba [32] divides the input image into multiple local windows to efficiently capture local dependencies, and adds spatial and channel attention modules before patch merging to reduce information redundancy and enhance features. There is also EfficientVMamba [33], SiMBA [34], etc., which all greatly extend the application of Mamba.

While Mamba has gained attention, it has also been intensively studied in the field of medical analysis. The authors of Mamba-UNet [35] introduced VMamba in the U-Net framework and proposed an integrated mechanism to achieve accurate medical image segmentation. U-Mamba [36] is a hybrid CNN-SSM model, which combines the ability of SSM to capture global features with the local feature extraction capability of convolutional layers. LightM-UNet [37] combines Mamba and UNet to design lightweight frameworks to facilitate the potential for model lightweighting. Others, including SegMamba [38], Vm-unet [39], etc., have achieved effective performance in segmentation tasks. In addition, MedMamba [40] introduces a hybrid basic module, SS-Conv-SSM, and employs a grouped convolution strategy and channel stochastic operations for medical image classification. In registration, there are not many Mamba-based models. MambaMorph [41] introduces a Mamba structure in the TransMorph framework and adds a feature extractor to improve the multi-modal registration accuracy.

Despite the overall effectiveness of these methods, the important contribution of local information to accuracy is often overlooked, especially for computationally intensive tasks such as registration tasks. In this article, combining convolution and Mamba, we combine global and local information to achieve dense prediction for deformable registration.

2.3. Diffeomorphic Registration

Diffeomorphic Registration is a technique that is extremely important in medical image registration. Its core goal is to achieve accurate alignment between different images by ensuring continuity and reversibility in the image morphing process. The application of this technique in medical image processing is crucial due to its ability to maintain the integrity and consistency of image topology. Specifically, differential homoembryonic registration emphasizes the generation of a deformation field with microscopic bijective mapping and ensures that both the mapping and its inverse for this deformation field are microscopic in order to accurately control the image deformation to avoid topological errors that may occur during deformation, such as image tearing or overlapping. This feature ensures the coherence and accuracy of tissue and anatomical structures in image processing and analysis, enabling physicians and researchers to more reliably perform diagnosis and treatment planning.

Some traditional methods obtain differential deformation fields through time integration [2], such as SyN [1] and LDDMM [2]. Specifically, the time-varying velocity field

υ (t)

is obtained through time integration of the deformation field

ϕ (t)

:

\frac{d ϕ (t)}{d t} = υ (t) (ϕ (t))

,

ϕ (0) = i d

, where id denotes the unit transformation. And some deep learning methods tend to integrate this process into network training and inference by transforming the stationary velocity field (SVF) into a differential homogeneous deformation field through scaling and squaring techniques [42]. Given a stationary velocity field

υ

, the deformation field

ϕ

is obtained through exponential mapping:

ϕ (1) = e x p (υ)

. Here,

e x p (υ)

denotes the exponential mapping of the velocity field, but since it is often complicated to compute the exponential mapping directly, scaling and squaring techniques are used to scale the velocity field to smaller time steps and square it many times to approximate the exponential mapping. In our work, we use this technique for differential homozygote registration. The time step was set to 7 based on previous experience [6,43]. Hence, the initial transformation is

ϕ (1 / 2^{k}) = e x p (υ / 2^{k}), k = 7

, and then, the complete deformation field is approximated by squaring many times:

ϕ (1) = {(ϕ (1 / 2^{k}))}^{2^{k}} .

3. Method

In our approach, we train the model using an unsupervised framework, and hence, no ground truth involvement is required. The overall framework is shown in Figure 1b. Given a pair of moving

I_{m}

and fixed images

I_{f}

from different subjects, modes, or scanning times, the corresponding deformation field

ϕ

is estimated directly through the network model. Then, we use this deformation field to deform the moving image

I_{m}

via the STN (spatial Transformer network) [44] so as to make

I_{m} \circ ϕ

consistent with the soft tissue and anatomical structure of the fixed image

I_{f}

, thus enabling an accurate comparison and analysis of the images. In the training stage, the training of the accurate deformation field is guided by a defined loss function, which can be written as:

ϕ^{*} = a r g min_{ϕ} [E_{(I_{f}, I_{m}) ϵ D} L_{s i m} (I_{m} \circ ϕ, I_{f}) + λ L_{s m o o t h} (ϕ)] .

(1)

Here, D denotes the training samples, and

L_{s i m}

and

L_{s m o o t h}

represent different loss functions.

λ

indicates the regularization weight, which is a hyperparameter used to balance

L_{s i m}

and

L_{s m o o t h}

. We will detail these and include the network architecture and other specific implementation details in the subsequent sections.

3.1. Network Overview

The network architecture of the proposed RegMamba is a simple U-style framework with an encoder–decoder structure. As shown in Figure 1b, it contains a total of five stages. Regarding the encoder part, it is mainly responsible for downscaling the data and gradually extracting the multi-level features of the input image. Except for the first stage, which uses only two simple convolutional layers to preserve the original dimensions and information, the remaining the stages consist of a deformable convolutional residual module and a scaled convolutional Mamba. Specifically, the feature information is first passed through the deformable convolutional residual module to improve the ability and stability of the local feature representation, and then, the global information is captured using the scaled convolution Mamba. In addition, instead of using a separate convolutional layer with a stride length of 2 for downsampling, we use the deformable convolutional residual module to complete the downsampling. And with each completion of the downsampling, the number of channels and spatial resolution of the outputs of all the layers in the same stage remain unchanged, i.e., the same size is maintained in the same stage. Thus, given a volume with a resolution of H × W × L, its spatial resolution is only reduced by a factor of 16 in total, and the output size of the last stage is

\frac{H}{16}

×

\frac{W}{16}

×

\frac{L}{16}

. Meanwhile, the number of channels increases progressively as the resolution is reduced, starting from the initial C stage by stage:

[C, 2 C, 4 C, 8 C, 8 C]

. Regarding the decoder section, its main mission is to recover the spatial resolution of the feature map while processing the extracted features to generate a dense prediction field. Each stage consists of a transposed convolutional and a vanilla convolutional layer, and all convolutional kernel sizes are 3 × 3 × 3. Moreover, Skip Connections are used to deliver the features from each layer of the encoder directly to the corresponding layer of the decoder during the upsampling process. This connection ensures the fusion of high-level semantic information with low-level detail information to improve the accuracy and detail preservation of the registration results. The final deformation field

ϕ

is generated directly from the convolutional layers in the last stage. And except for the last convolutional layer, all the convolutional layers are connected with an activation function after them to increase the nonlinear expressiveness of the model and improve the feature extraction.

Benefiting from this design, the model is able to better capture potential local and global information. In the next subsections, we will introduce the scaled convolution Mamba and the deformable convolutional residual module in detail.

3.2. Scaled Convolutional Mamba Block

Transformer-based registration networks have been shown to have significant performance advantages due to their effective global dependency modeling capabilities, but typically face significant computational challenges due to the quadratic complexity of the self-attention mechanism. Recently, Mamba has been recognized as an effective alternative to Transformers, and is mainly inspired by state-space models (especially SSM4 [45]). These models depend on a system of linear ordinary differential equations that effectively map the 1D sequence of inputs

x (t) ϵ R

to the output

y (t) ϵ R

through a hidden state

h (t) ϵ R^{N}

:

h^{'} (t) = A h (t) + B x (t),

(2)

y (t) = C h (t) + D x (t) .

(3)

where N denotes the state size. Specifically, the model accomplishes two-stage sequence-to-sequence transitions by means of the defined parameters

A ϵ R^{N \times N}

,

B ϵ R^{N}

,

C ϵ R^{N}

, and

D ϵ R^{1}

.

Subsequently, considering the discrete signals from the input samples, the zero-order keeper method is usually integrated into the state space. The time scale parameter

Δ

is introduced to discretize A and B. Due to this way of inferencing, the state-space model has computational parallelism and linear time complexity; however, it often suffers from gradient vanishing and memory forgetting problems for long sequences. To address this problem, the creators of SSM4 employed a High-Order Polynomial Projection Operator (HIPPO) to construct the state matrix

A ϵ R^{N \times N}

to improve remote dependency modeling. Mamba is an extension of this approach, employing a selective mechanism that allows the model to adaptively adjust to the characteristics of the input data, and a hardware optimization algorithm to achieve efficient computation. This innovation enables the model to handle long sequences with significantly improved performance, and the computational complexity problem is also effectively mitigated.

Due to these advantages, some scholars have recently employed it in the field of computer vision, but most of them focus on designing different scanning methods to further enhance its global modeling capability while neglecting the local information capturing capability. However, for dense prediction tasks such as deformable medical image registration tasks, the ability to capture local information is usually crucial. Furthermore, it is also worth noting that Mamba usually processes data in 1D sequences. This way of processing data for 3D medical images may damage its spatial location information and anatomical structure integrity. To mitigate such potential damage, we propose the scaled convolutional mamba block (SCMB), as shown in Figure 1c.

The proposed SCMB contains two branches, each of which performs a different task to fully utilize the strengths of the original Mamba module and to enhance the local information capturing capability, one of which is a complete original Mamba block, where we do not change any structure to preserve its efficient performance in processing global information and capturing complex dynamics. Another branch focuses on complementing sensitive local information and spatial location information. Specifically, this branch first extracts the local information of the input features through a convolutional layer. Then, the features extracted by this convolutional layer are adaptively scaled using a learnable parameter to adjust the weights, enabling the model to autonomously optimize the feature extraction process according to the specific task and dataset characteristics, thus improving the flexibility and accuracy of the model. Finally, the output feature maps of the two branches are aggregated by element-wise addition. Given the input

F_{i n} ϵ R^{C \times H \times W \times L}

, the SCMB module can be described as:

F_{o u t} = M a m b a (F_{i n}) \oplus α S i L U (C o n v 3 d (F_{i n})) .

(4)

Here, ⊕ denotes element-wise addition, and

α

denotes the learnable scaling parameter. For Mamba, we first conduct the planarization of

F_{i} n

to

F_{i} n ϵ R^{N \times C}

and apply LayerNorm. Following this, the input feature

F_{i} n

is taken along the channel dimension and divided into two parts,

F_{1} ϵ R^{N \times \frac{C}{2}}

and

F_{2} ϵ R^{N \times \frac{C}{2}}

. The first half of the features are activated using the SiLU function [46] after expanding the dimension to C via a linear transformation. The second half of the features are sequentially passed through the linear layer of the expanded dimension, 1D convolutional layer, state-space model (SSM), and LayerNorm [47]. And then, the outputs of the two parts are fused using the Hadamard product. The whole process of Mamba can be formalized as follows:

\tilde{F_{a}} = S i L U (L i n e a r (F_{1})),

(5)

\tilde{F_{b}} = L N (S S M (C o n v 1 d (L i n e a r (F_{2})))),

(6)

M a m b a (F_{i n}) = L i n e a r (\tilde{F_{a}} ⊙ \tilde{F_{b}}) .

(7)

Here, ⊙ denotes the Hadamard product.

3.3. Deformable Convolutional Residual Module

Although the scaled convolutional Mamba can alleviate the problem of missing local and spatial information to some extent, we still believe that designing an additional module focusing on local feature extraction is necessary. To this end, we propose the deformable convolutional residual module (DCRM). This module enables the convolutional kernel to flexibly adjust the sampling position by learning spatially adaptive offsets to capture complex local features.

The DCRM is designed to be simple, containing only three convolutional layers: a 3 × 3 × 3 deformable convolutional layer, a 3 × 3 × 3 vanilla convolutional layer, and a 1 × 1 × 1 convolutional layer. Among them, the vanilla convolutional layer is used to capture general features. Subsequently, the sampling points of the vanilla convolution kernel are fixed, uniformly distributed, and friendly to regularly shaped features, which may lead to difficulties in adapting to the complex physiological organization and deformations in medical images, thus we use a deformable convolution [48] after the vanilla convolution. This convolution first adds an extra convolutional layer to learn the offsets, and then adjusts the sampling position of the standard convolutional kernel based on the offsets output from this convolutional layer to obtain an adaptive convolutional kernel that allows it to flexibly adapt to the local features of the input image. This flexibility enables the model to better capture the boundaries between deformations and anatomical structures, which is advantageous for the characterization of local features. Although this inevitably adds extra computation compared to vanilla convolution, the computational cost is acceptable. Finally, while maintaining computational efficiency, the employment of dilation convolutional [49] layers helps to capture different scale information and extract more representative deep features to enhance the expressive power of the model. The residual connected ones, in this way, also ensure the effective propagation of feature information.

The overall structure of DCRM is shown in Figure 1d and can be mathematically represented as:

F_{out} = DeConv (DeConv (F_{in})) \oplus DiConv (F_{in})

(8)

where

F_{i n} ϵ R^{C \times H \times W \times L}

denotes the input features,

D e C o n v

represents the deformable convolutional layer,

C o n v

refers to vanilla convolution, and

D i C o n v

is dilation convolution. In this way, DCRM can efficiently enhance the feature representation of the model without adding too much computational complexity.

3.4. Loss Function

Since we adopt an unsupervised strategy, the loss function contains only two parts: image similarity loss (

L_{s i m}

) and deformation field smoothness loss (

L_{s m o o t h}

). The goal of

L_{s i m}

is to minimize the difference between the moving image and the fixed image, so that the deformed moving image is as consistent as possible with the fixed image in terms of anatomical structure, thus improving the accuracy and reliability of registration. On the other hand,

L_{s m o o t h}

aims to ensure that the deformation field estimated by the model is sufficiently smooth and reasonable and avoids discontinuities and drastic changes. This constraint not only helps to generate physically reasonable deformation fields, but also prevents artifacts and error accumulation during the registration process. The total loss function is usually written as:

L_{t o t a l} = L_{s i m} (I_{m} \circ ϕ, I_{f}) + λ L_{s m o o t h} (ϕ)

(9)

where

I_{m} \circ ϕ

indicates the

I_{m}

distorted by

ϕ

, and

λ

is the regularization parameter.

Image Similarity Loss: A total of two image similarity losses are used in our experiments, which are local normalized cross-correlation (LNCC) and the mean square error (MSE). The first one is formulated as:

L N C C (I_{f}, I_{m} \circ ϕ) = \frac{{(\sum_{n_{i}}^{} (I_{f} (n_{i}) - \bar{I_{f}} (n)) ([I_{m} \circ ϕ] (n_{i}) - [\bar{I_{m} \circ ϕ}] (n)))}^{2}}{(\sum_{n_{i}}^{} {(I_{f} (n_{i}) - \bar{I_{f}} (n))}^{2}) (\sum_{n_{i}}^{} {([I_{m} \circ ϕ] (n_{i}) - [\bar{I_{m} \circ ϕ}] (n))}^{2})},

(10)

The second one is formulated as:

M S E (I_{f}, I_{m} \circ ϕ) = \frac{1}{Ω} \sum_{n \in Ω}^{} {|I_{f} (n) - [I_{m} \circ ϕ] (n)|}^{2} .

(11)

where

Ω

denotes the size of the image volume, and n refers to the pixels in the image.

\bar{I_{m}}

,

\bar{I_{f}}

, and

\bar{I_{m} \circ ϕ}

indicate the average pixel values within a window of size

N^{3}

centered on pixel n. In our study, we set N = 9.

Deformation Field Smoothness Loss: This is also often called regularized loss. In this article, we use diffusion regularizers [5] to emphasize the spatial gradient of the deformation field:

L_{s m o o t h} (ϕ) = \sum_{n \in Ω}^{} {∥▽ ϕ (n)∥}^{2} .

(12)

4. Experiments

4.1. Data and Pre-Processing

IXI: The IXI dataset (https://brain-development.org/ixi-dataset/, accessed on 3 March 2023) is derived from a publicly available dataset published by Information eXtraction, and consists of nearly 600 3D MRI scans of healthy subjects. Based on previous work [6], we acquired a single moving image. Then 576 images were selected from the dataset as fixed images for the experiments, which were divided into 378 groups of training, validation and testing in a ratio of 7:1:2, containing 403, 58, and 115 images, respectively. In addition, all image volumes were cropped to a size of 160 × 192 × 224, and labeled maps of 30 anatomical structures were used as baseline truth values for evaluating registration performance.

LPBA40: The LPBA40 dataset (https://www.loni.usc.edu/research/atlases, accessed on 12 May 2023) [50] contains 40 T1-weighted MRI images, each labeled with 56 subcortical regions of interest (ROIs). According to the code provided in the literature [51], we combined these 56 labels into 7 regions: frontal lobe, occipital lobe, parietal lobe, temporal lobe, chiasma nucleus, cingulate lobe, and hippocampus. All images were cropped to a size of 160 × 192 × 160 to normalize the input data. After the data processing was completed, we selected 30 of the images to train the model, generating 30 × 29 registration volume pairs. The remaining 10 images were used for testing, generating 10 × 9 test registration pairs.

4.2. Evaluation Metrics

Dice coefficient: The Dice coefficient [52] is one of the most important indexes to evaluate the performance of image registration. It measures the degree of overlap of anatomical structures or organ structures after registration, and the value of the Dice coefficient is between 0 and 1. The higher the value, the better the accuracy and reliability of registration, and the higher the degree of overlap of anatomical structures between fixed images and aligned images:

D i c e (S_{m}, S_{f}) = \frac{1}{K} Σ_{n = 1}^{K} \frac{2 \times |S_{f}^{n} \cap S_{m}^{n}|}{|S_{f}^{n}| \times |S_{m}^{n}|} .

(13)

where K denotes the number of labeling categories, each containing a binary mask.

S_{m}

and

S_{f}

represent the anatomical structure of the distorted moving image versus the fixed image, respectively.

Jacobian determinant matrix: The computation of the Jacobian determinant facilitates [24] the detection and evaluation of the smoothness and irrationality of the deformation field. It describes the local rate of change of the deformation field. When the value of the determinant is zero or negative, it indicates that inversion, contraction, or folding has occurred at that local location. Specifically,

|J_{ϕ} (p)| \leq 0

indicates that the deformation field at position p has anomalous deformation, and the deformation feasibility and credibility are insufficient. We calculate its percentage, and the smaller value indicates the more reasonable deformation.

Hausdorff distance: The Hausdorff distance measures the maximum distance between the farthest point pairs between the fixed and deformed moving images, and therefore captures the worst case scenario of the registration results well. The smaller the distance, the higher the similarity between the fixed and deformed image sets and the better the registration results:

H D = \frac{1}{K} Σ_{n = 1}^{K} \{d (p, q) ∣ p \in S_{f}^{n}, q \in S_{m}^{n}\} .

(14)

Here, d(p,q) denotes the distance between point p in the deformed segmentation structure and the nearest point q in the ground truth.

Mult-Adds(G): We use Mult-Adds to measure the complexity of a neural network, where smaller values of multiply–add operations indicate a less complex model, which means less computational cost and resource consumption.

Time: The time calculates the iteration time required from the image pair input model to the output deformation field and deformed image.

4.3. Baseline Methods

In this section, we present some related algorithms for comparison, including two categories: traditional algorithms and advanced learning-based registration algorithms. To ensure the fairness of the comparison, we use the same similarity and smoothness loss functions on the same datasets, unless otherwise stated.

SyN: Symmetric Normalization [1] uses a symmetric deformation field in the registration to ensure the consistency and accuracy of the deformation process. We use the mean square deviation (MSQ) as the objective function for the iterative optimization and set the number of iterations to (180, 80, 40) and the Gaussian smoothing to 3.

NiftyReg: NiftyReg [4] is a powerful and efficient toolkit for medical image registration. In our experiments, we use the sum of squared differences (SSD) as the objective function for iterative optimization, the smoothing loss weight is set to 0.0002, the scale is set to 3, and each scale is iterated 300 times.

VoxelMorph: We use two variants of VoxelMorph [5], VoxelMorph-1 and VoxelMorph-2, where the second one has twice as many convolutional filters as the first variant, and we use the NCC as the loss function, with the smoothing loss coefficient set to 1. The learning rate is 0.0004.

TransMorph: We use the officially provided open-source code, which consists of three variants [6], TransMorph, TransMorph-bspl, and TransMorph-diff, where the encoder consists of four strata of Swin Transformer blocks, with the number of blocks being 2, 2, 4, and 2, respectively. In addition, the window size is 5, 6, 7 and the embedding dimension C is 96.

4.4. Experimental Settings

The proposed RegMamba is implemented by PyTorch and performed on a computer equipped with an NVIDIA TITAN RTX GPU, an Intel i5-13490F CPU, and an RTX 4080 GPU. To increase data diversity, we perform data enhancement operations such as random flipping on all training data. For all our variants, i.e., RegMamba and RegMamba-diff, the same parameter settings are shared, including smoothness loss coefficients on the same dataset. Specifically, the number of scaled convolutional Mamba blocks is set to (1, 1, 1, 2), the number of deformable convolutional residual blocks is set to 2, and the number of starting channels is set to 8. And the smoothing loss hyper-parameter is set to 4 when Normalized Correlation Correlation (NCC) is used as the loss function for image similarity, and the smoothing loss hyper-parameter is set to 0.04 when the mean squared error (MSE) is used. In addition, the learning rate of all training is set to 0.0004, and the Adam optimization algorithm is used for training with 500 iterations and a batch size of 1 for each iteration. Our code is available at https://github.com/Hexu00/Regmamba, accessed on 1 August 2024.

4.5. Results

We performed quantitative and qualitative comparisons on each dataset separately. Table 1 shows the quantitative results on the IXI dataset, in which our method achieves the highest Dice coefficient of 0.8665, which is about a 1.7% improvement compared to TransMorph, and achieves a significant improvement compared to other conventional or Transformer-based methods. In addition, our method significantly outperforms TransMorph, PVT, and nnFormer in terms of the number and complexity of model parameters. Specifically, we use only 6.3% of the number of parameters and 0.13% of the multiplication and addition operations (multi-add) of TransMorph, which also shows significant advantages in terms of computational cost and resource usage. For differential registration, our method, RegMamba-diff, yields a suboptimal solution of 0.7627, which is higher than VoxelMorph-diff (0.577) and TransMorph-diff (0.594). Due to its differential homogeneous embryo property, this registration results in high mathematical accuracy with almost no folding. This property ensures the rationality and smoothness of the deformation field and improves the reliability and interpretability of registration. Figure 2 shows a comparison of the quantitative results of anatomical structures in different scans in the IXI dataset. We selected eight classical methods for comparison, comparing the performance of several anatomical structures, and the results show that our method improves on most of the structures.

The qualitative results are shown in Figure 3. In the boxplots, we compare the performance of multiple anatomical structures, and the results show that our method improves on most of the structures. In terms of visual results, the first row shows that our method has the highest similarity in detail to the fixed image, with clearer detail structures. The second row demonstrates the smoothness of the deformation field, which is significantly better than that of the Transformer-based models such as TransMorph, which show drastic variations in the deformation field, while the deformation field estimated by our model is much smoother, showing better continuity and reasonableness. The last row shows the absolute difference between the deformed motion image and the stationary image. These results show that our method not only performs well in quantitative metrics, but also demonstrates excellent performance in terms of visual effects and geometric continuity.

For the LPBA40 dataset, we used the same model as when training the IXI dataset, so the differences in time and memory consumption remain similar. The quantitative results are shown in Table 2. We compared them with eight different methods. While most of the deep learning-based methods, such as VoxelMorph, ViT-V-Net, etc., do not have a significant advantage over the traditional method SyN, our method achieves the best results, with a Dice coefficient of 0.839, which is an improvement of about 2.3% over SyN and 1.7% over TransMorph. The mean squared error of Dice is 0.014, second only to SyN. In addition, our method also achieves the best performance on the HD95 metric, indicating that our method excels in boundary definition and geometric consistency while maintaining accuracy. These results further validate the effectiveness and robustness of our model in handling different types of medical image registration tasks. In terms of voxel folding, we have only 0.055% (

| J | \leq 0

), which indicates that the deformation field generated by the model only produces drastic changes in a very small portion of the region, showing its good stability and robustness. And the quantitative results for the different structures are shown in Figure 4. The boxplots show the distribution of Dice scores for the seven anatomical structures after merging, and our comparison with the six advanced methods shows that RegMamba is significantly improved in each section.

The qualitative results for the LPBA40 dataset can be viewed Figure 5. In the result plots, we labeled the outlines of the anatomical structures. Intuitively, the results of all methods may seem imperfect in some details. However, in the deformation mesh, our methods show good smoothness and no obvious folding. In the absolute difference maps, both our results and SyN’s have good performance, which further validates the effectiveness of our methods.

Ablation Study

The results of the first ablation experiment can be viewed in Table 3. We first performed detailed ablation experiments on the publicly available 2D dataset of the OASIS database (https://sites.wustl.edu/oasisbrains/, accessed on 20 March 2024) and used the Dice score as an evaluation metric and the MSE as the image similarity loss function. The experiments are divided into two groups, where one group is used to evaluate the impact of each component on performance and one group is used to check the impact of model size on performance, aiming to comprehensively test the effective contribution of the individual components of the proposed model to registration performance, and thus provide a deeper understanding of the model’s behaviors and strengths. In group A, we check the results of pure deformable residual block, pure Mamba block, and pure scaled convolutional Mamba, and the experiment proves that several models all contribute to the performance, but the best performance is achieved when using their combination. In the Group B experiments, we examined the effect of different model sizes on performance. By varying the number of initial channels of the model, we find that our model is already able to achieve near-optimal performance at smaller model sizes, showing its efficiency and superior parameter utilization. The results of this experiment validate the effectiveness and innovativeness of our model design while demonstrating the importance of incorporating multiple scale properties in medical image registration tasks.

The results of the second ablation experiment can be viewed in Table 4, performed on the IXI dataset, where we divided the experiments into four groups, separately analyzing the contribution of each group of blocks to registration performance. Among them, the scaled convolutional Mamba blocks show the best registration performance, achieving a Dice score of 76.5. The best registration performance is achieved when the scaled convolutional Mamba block is combined with the deformable convolutional residual block.

5. Conclusions

The deformable medical image registration task is a challenging task due to the large number of parameters and unstable acquisition conditions. In this paper, we propose the RegMamba network for capturing complex correspondences between medical images, which is expected to cope with the expensive computational problems in Transformer-based models while obtaining better registration results. To this end, we propose two modules, a scaled convolutional Mamba module for supplementing the potentially missing local and spatial information in the pure Mamba block, and a deformable convolutional residual block used to extract local correspondences that are crucial in the registration process. Benefiting from this design, we show good performance on multiple datasets. This is not only important for theoretical studies, but also has potential applications in clinical practice. For example, accurate image registration can be used to evaluate disease progression and surgical recovery by comparing images at different time points in order to select more appropriate treatment options. And in targeted cancer therapy, precise registration can be used to align the target area with the actual diseased area, thus improving the therapeutic effect and minimizing the damage to healthy tissues. On the other hand, this increased efficiency not only reduces the time required for disease analysis, but also facilitates the development of emerging technologies such as real-time guidance and unmanned surgery.

However, our work still has some limitations. Firstly, there is the problem of computational resources. The deformable convolutional residual module inevitably increases the computational burden of the model, and we hope to further explore the potential of combining mamba with CNNs in our subsequent work. Second, due to the limitation of available GPU and memory, we did not intensively search for the optimal values of hyperparameters in training, but rather, determined them based on the original paper’s suggested values or empirically, which may bring challenges regarding practical applications. Finally, our model was only tested on a single modality, and we expect to extend the model to multimodal scenarios in future work to more fully evaluate its applicability.

In summary, our work is interesting and we look forward to more effective registration networks that combine Mamab and CNN in different ways in the future.

Author Contributions

Conceptualization, X.H.; Methodology, X.H.; Software, X.H.; Validation, X.H.; Formal analysis, X.H.; Writing—original draft, X.H.; Writing—review & editing, J.C. and Y.C.; Visualization, X.H.; Supervision, Y.C.; Project administration, Y.C.; Funding acquisition, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Science and Technology Research Project of the Education Department of Hubei Province under Grant No. Q20221501 and the Graduate Innovative Fund of Wuhan Institute of Technology under Grant No. CX2023289.

Informed Consent Statement

All relevant ethical guidelines and norms were strictly adhered to in this study. The data used were obtained from publicly available databases, and informed consent was obtained from all participants at the time of data collection. These datasets were appropriately anonymized prior to public release to protect participants’ privacy and confidential information. We are committed to continuing to maintain data security and participant privacy.

Data Availability Statement

The authors confirm that the data supporting the findings of this study are available within the article.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A

Figure A1. Subplots of Figure 2 containing boxplot distributions for 17 different substructures.

Figure A2. Subplots of Figure 4 containing the boxplot distribution of seven different substructures.

References

Avants, B.; Epstein, C.; Grossman, M.; Gee, J. Symmetric diffeomorphic image registration with cross-correlation: Evaluating automated labeling of elderly and neurodegenerative brain. Med. Image Anal. 2008, 12, 26–41. [Google Scholar] [CrossRef] [PubMed]
Beg, M.F.; Miller, M.I.; Trouvé, A.; Younes, L. Computing Large Deformation Metric Mappings via Geodesic Flows of Diffeomorphisms. Int. J. Comput. Vis. 2005, 61, 139–157. [Google Scholar] [CrossRef]
Heinrich, M.P.; Maier, O.; Handels, H. Multi-modal Multi-Atlas Segmentation using Discrete Optimisation and Self-Similarities. In Proceedings of the VISCERAL Challenge@ISBI, Brooklyn, NY, USA, 16–19 April 2015. [Google Scholar]
Modat, M.; Ridgway, G.R.; Taylor, Z.A.; Lehmann, M.; Barnes, J.; Hawkes, D.J.; Fox, N.C.; Ourselin, S. Fast free-form deformation using graphics processing units. Comput. Methods Programs Biomed. 2010, 98, 278–284. [Google Scholar] [CrossRef] [PubMed]
Balakrishnan, G.; Zhao, A.; Sabuncu, M.R.; Guttag, J.; Dalca, A.V. VoxelMorph: A Learning Framework for Deformable Medical Image Registration. IEEE Trans. Med. Imaging 2019, 38, 1788–1800. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Frey, E.C.; He, Y.; Segars, W.P.; Li, Y.; Du, Y. TransMorph: Transformer for unsupervised medical image registration. Med. Image Anal. 2022, 82, 102615. [Google Scholar] [CrossRef] [PubMed]
Jia, X.; Bartlett, J.; Zhang, T.; Lu, W.; Qiu, Z.; Duan, J. U-Net vs Transformer: Is U-Net Outdated in Medical Image Registration? In Proceedings of the Machine Learning in Medical Imaging, Singapore, 18 September 2022; Lian, C., Cao, X., Rekik, I., Xu, X., Cui, Z., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 151–160. [Google Scholar]
Kim, B.; Kim, D.H.; Park, S.H.; Kim, J.; Lee, J.G.; Ye, J.C. CycleMorph: Cycle consistent unsupervised deformable image registration. Med. Image Anal. 2021, 71, 102036. [Google Scholar] [CrossRef]
Shi, J.; He, Y.; Kong, Y.; Coatrieux, J.L.; Shu, H.; Yang, G.; Li, S. XMorpher: Full Transformer for Deformable Medical Image Registration via Cross Attention. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2022, Singapore, 18–22 September 2022; Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 217–226. [Google Scholar]
Sokooti, H.; Vos, B.D.; Berendsen, F.; Lelieveldt, B.P.F.; Išgum, I.; Staring, M. Nonrigid Image Registration Using Multi-Scale 3D Convolutional Neural Networks; Springer: Cham, Switzerland, 2017; pp. 232–239. [Google Scholar]
Xie, Y.; Zhang, J.; Shen, C.; Xia, Y. CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI, Strasbourg, France, 27 September–2 October 2021; de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C., Eds.; Springer Nature: Cham, Switzerland, 2021; pp. 171–180. [Google Scholar]
Yang, X.; Kwitt, R.; Styner, M.; Niethammer, M. Quicksilver: Fast predictive image registration—A deep learning approach. NeuroImage 2017, 158, 378–396. [Google Scholar] [CrossRef] [PubMed]
Zhao, S.; Lau, T.; Luo, J.; Chang, E.I.C.; Xu, Y. Unsupervised 3D End-to-End Medical Image Registration with Volume Tweening Network. IEEE J. Biomed. Health Inform. 2020, 24, 1394–1404. [Google Scholar] [CrossRef] [PubMed]
de Vos, B.D.; Berendsen, F.F.; Viergever, M.A.; Staring, M.; Išgum, I. End-to-End Unsupervised Deformable Image Registration with a Convolutional Neural Network. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Quebec City, QC, Canada, 14 September 2017; Cardoso, M.J., Arbel, T., Carneiro, G., Syeda-Mahmood, T., Tavares, J.M.R., Moradi, M., Bradley, A., Greenspan, H., Papa, J.P., Madabhushi, A., et al., Eds.; Springer Nature: Cham, Switzerland, 2017; pp. 204–212. [Google Scholar]
Li, W.; Zhou, G.; Lin, S.; Tang, Y. PerNet: Progressive and Efficient All-in-One Image-Restoration Lightweight Network. Electronics 2024, 13, 2817. [Google Scholar] [CrossRef]
Jiao, C.; Yang, T.; Yan, Y.; Yang, A. RFTNet: Region–Attention Fusion Network Combined with Dual-Branch Vision Transformer for Multimodal Brain Tumor Image Segmentation. Electronics 2024, 13, 77. [Google Scholar] [CrossRef]
Baek, J.H.; Lee, H.K.; Choo, H.G.; Jung, S.h.; Koh, Y.J. Center-Guided Transformer for Panoptic Segmentation. Electronics 2023, 12, 4801. [Google Scholar] [CrossRef]
Zhang, Y.; Pei, Y.; Zha, H. Learning Dual Transformer Network for Diffeomorphic Registration. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2021—24th International Conference, Strasbourg, France, 27 September–1 October 2021; de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C., Eds.; Springer: Berlin/Heidelberg, Germany, 2021; Volume 12904, (Lecture Notes in Computer Science). pp. 129–138. [Google Scholar] [CrossRef]
Chen, J.; He, Y.; Frey, E.C.; Li, Y.; Du, Y. ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration. arXiv 2021, arXiv:2104.06468. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
Galić, I.; Habijan, M.; Leventić, H.; Romić, K. Machine Learning Empowering Personalized Medicine: A Comprehensive Review of Medical Image Analysis Methods. Electronics 2023, 12, 4411. [Google Scholar] [CrossRef]
Rueckert, D.; Sonoda, L.; Hayes, C.; Hill, D.; Leach, M.; Hawkes, D. Nonrigid registration using free-form deformations: Application to breast MR images. IEEE Trans. Med. Imaging 1999, 18, 712–721. [Google Scholar] [CrossRef]
Johnson, H.J.; Christensen, G.E. Landmark and Intensity-Based, Consistent Thin-Plate Spline Image Registration. In Proceedings of the Information Processing in Medical Imaging, Davis, CA, USA, 18–22 June 2001; Insana, M.F., Leahy, R.M., Eds.; Springer: Berlin/Heidelberg, Germany, 2001; pp. 329–343. [Google Scholar]
Ashburner, J. A fast diffeomorphic image registration algorithm. NeuroImage 2007, 38, 95–113. [Google Scholar] [CrossRef] [PubMed]
Vercauteren, T.; Pennec, X.; Perchant, A.; Ayache, N. Diffeomorphic demons: Efficient non-parametric image registration. NeuroImage 2009, 45, S61–S72. [Google Scholar] [CrossRef]
Zhao, S.; Dong, Y.; Chang, E.; Xu, Y. Recursive Cascaded Networks for Unsupervised Medical Image Registration. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
Chen, J.; Lu, D.; Zhang, Y.; Wei, D.; Ning, M.; Shi, X.; Xu, Z.; Zheng, Y. Deformer: Towards Displacement Field Learning for Unsupervised Medical Image Registration. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI, Singapore, 18–22 September 2022; Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 141–151. [Google Scholar]
Li, Y.X.; Tang, H.; Wang, W.; Zhang, X.F.; Qu, H. Dual attention network for unsupervised medical image registration based on VoxelMorph. Sci. Rep. 2022, 12, 16250. [Google Scholar] [CrossRef]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Yang, C.; Chen, Z.; Espinosa, M.; Ericsson, L.; Wang, Z.; Liu, J.; Crowley, E.J. PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition. arXiv 2024, arXiv:2403.17695. [Google Scholar]
Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. LocalMamba: Visual State Space Model with Windowed Selective Scan. arXiv 2024, arXiv:2403.09338. [Google Scholar]
Pei, X.; Huang, T.; Xu, C. EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba. arXiv 2024, arXiv:2403.09977. [Google Scholar]
Patro, B.N.; Agneeswaran, V.S. SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series. arXiv 2024, arXiv:2403.15360. [Google Scholar]
Wang, Z.; Zheng, J.Q.; Zhang, Y.; Cui, G.; Li, L. Mamba-UNet: UNet-Like Pure Visual Mamba for Medical Image Segmentation. arXiv 2024, arXiv:2402.05079. [Google Scholar]
Ma, J.; Li, F.; Wang, B. U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
Liao, W.; Zhu, Y.; Wang, X.; Pan, C.; Wang, Y.; Ma, L. LightM-UNet: Mamba Assists in Lightweight UNet for Medical Image Segmentation. arXiv 2024, arXiv:2403.05246. [Google Scholar]
Xing, Z.; Ye, T.; Yang, Y.; Liu, G.; Zhu, L. SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation. arXiv 2024, arXiv:2401.13560. [Google Scholar]
Ruan, J.; Xiang, S. VM-UNet: Vision Mamba UNet for Medical Image Segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar]
Yue, Y.; Li, Z. MedMamba: Vision Mamba for Medical Image Classification. arXiv 2024, arXiv:2403.03849. [Google Scholar]
Guo, T.; Wang, Y.; Shu, S.; Chen, D.; Tang, Z.; Meng, C.; Bai, X. MambaMorph: A Mamba-based Framework for Medical MR-CT Deformable Registration. arXiv 2024, arXiv:2401.13934. [Google Scholar]
Arsigny, V.; Commowick, O.; Pennec, X.; Ayache, N. A Log-Euclidean Framework for Statistics on Diffeomorphisms. In Medical Image Computing and Computer-Assisted Intervention: MICCAI, Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Copenhagen, Denmark, 1–8 October 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 924–931. [Google Scholar]
Dalca, A.V.; Balakrishnan, G.; Guttag, J.; Sabuncu, M.R. Unsupervised Learning for Fast Probabilistic Diffeomorphic Registration. In Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2018; pp. 729–738. [Google Scholar] [CrossRef]
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Cambridge, MA, USA, 7–12 December 2015; Volume 2, pp. 2017–2025. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces. arXiv 2022, arXiv:2111.00396. [Google Scholar]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for Activation Functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv 2016, arXiv:1511.07122. [Google Scholar]
Shattuck, D.W.; Mirza, M.; Adisetiyo, V.; Hojatkashani, C.; Salamon, G.; Narr, K.L.; Poldrack, R.A.; Bilder, R.M.; Toga, A.W. Construction of a 3D probabilistic atlas of human cortical structures. NeuroImage 2008, 39, 1064–1080. [Google Scholar] [CrossRef] [PubMed]
Kuang, D.; Schmah, T. FAIM—A ConvNet Method for Unsupervised 3D Medical Image Registration. In Proceedings of the Machine Learning in Medical Imaging, Shenzhen, China, 13 October 2019; Suk, H.I., Liu, M., Yan, P., Lian, C., Eds.; Springer Nature: Cham, Switzerland, 2019; pp. 646–654. [Google Scholar]
Dice, L.R. Measures of the Amount of Ecologic Association Between Species. Ecology 1945, 26, 297–302. [Google Scholar] [CrossRef]
Qiu, H.; Qin, C.; Schuh, A.; Hammernik, K.; Rueckert, D. Learning Diffeomorphic and Modality-invariant Registration using B-splines. In Proceedings of the Medical Imaging with Deep Learning, Virtual Event, 19–21 July 2021. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved baselines with Pyramid Vision Transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Zhou, H.; Guo, J.; Zhang, Y.; Yu, L.; Wang, L.; Yu, Y. nnFormer: Interleaved Transformer for Volumetric Segmentation. arXiv 2021, arXiv:2109.03201. [Google Scholar]
Yu, Z.; Chen, L.; Cheng, Z.; Luo, J. TransMatch: A Transfer-Learning Scheme for Semi-Supervised Few-Shot Learning. arXiv 2020, arXiv:1912.09033. [Google Scholar]

Figure 1. An overall flowchart of our proposed method. (a) The overall framework of unsupervised medical image registration, where the inputs are concatenated moving images and a fixed image volume; (b) a detailed diagram of the network structure based on the U-Net style, containing the distribution locations of its main components, where C denotes the initial number of channels, and H, W, and L denote the height, width, and length of the input image, respectively; (c) the deformable convolutional residual module; and (d) the scaled convolutional Mamba module.

Figure 2. Boxplot represents the comparison of Dice scores of different anatomical structures on IXI data using RegMamba and existing advanced registration methods. Subfigures are viewed in Appendix A, Figure A1.

Figure 3. A visual comparison of the different methods on the IXI dataset. In the first row, the first image is the fixed image and the rest are the deformed moving images. The first image in the second row is the original moving image, and the subsequent images show the effect of the deformed grid. The third row shows the deformed field, where the x, y, and z spatial dimensions are mapped to RGB color channels. The last row shows the difference between the deformed moving image and the fixed image.

Figure 4. The boxplot compares the Dice scores of our method with other existing advanced registration methods for seven different anatomical structures merged on LPBA40 data. Subfigures are viewed in Appendix A, Figure A2.

Figure 5. A visual comparison of different methods on the LPBA40 dataset. The first image of the first line is a fixed image and a distorted moving image; the lines show the contours of different anatomical structures. The second row shows the grid display effect of the moving image and the deformation field. The third row shows the deformation field on the RGB color channel. The last line is the absolute difference between the deformed moving image and the fixed image.

Table 1. Comparison of results of RegMamba with other methods of registration; bold numbers indicate best results, and second best results are in italics.

Model	Dice	$% of \| J \| \leq 0$	Time(s)	Parameters	Mult-Adds(G)
Affine	0.386 ± 0.195	-	-	-	-
SyN [1]	0.639 ± 0.151	<0.0001	-	-	-
NiftyReg [4]	0.640 ± 0.166	<0.0001	-	-	-
LDDMM [2]	0.675 ± 0.135	<0.0001	-	-	-
deedsBCV [3]	0.733 ± 0.126	0.147 ± 0.050	-	-	-
VoxelMorph-1 [5]	0.723 ± 0.130	1.590 ± 0.339	0.114	274,387	304.05
VoxelMorph-2 [5]	0.726 ± 0.123	1.522 ± 0.336	0.131	301,411	398.81
VoxelMorph-diff [43]	0.577 ± 0.165	<0.0001	0.042	307,878	89.67
CycleMorph [8]	0.730 ± 0.124	1.719 ± 0.382	0.143	361,299	49.42
MIDIR [53]	0.736 ± 0.129	<0.0001	0.061	266,387	47.05
ViT-V-Net [19]	0.728 ± 0.124	1.609 ± 0.319	0.139	9,815,431	10.60
CoTr [11]	0.721 ± 0.128	1.858 ± 0.314	0.478	38,684,995	1461.61
PVT [54]	0.729 ± 0.135	1.292 ± 0.342	0.149	58,749,007	193.61
nnFormer [55]	0.740 ± 0.134	1.595 ± 0.358	0.916	34,415,851	686.77
TransMorph [6]	0.746 ± 0.128	1.579 ± 0.328	0.249	46,771,251	657.64
TransMorph-Bayes [6]	0.753 ± 0.123	1.560 ± 0.333	5.942	21,205,491	657.69
TransMorph-diff [6]	0.594 ± 0.163	<0.0001	0.252	46,557,414	252.61
RegMamba	0.7665 ± 0.124	0.312 ± 0.126	0.185	2,983,742	329.67
RegMamba-diff	0.7627 ± 0.0.123	<0.0001	0.208	2,983,742	329.67

Table 2. The comparative results of the brain MRI registration task on the LPBA40 dataset. The best knots are indicated in bold. We report the mean Dice score (Dice), the 95th symmetric Hausdorff distance (HD in millimeters), and the percentage of voxels for which the Jacobian determinant is nonpositive for the different methods.

Model	LPBA40
Model	Dice ↑	HD95 ↓	% $of \| J \| \leq 0$ ↓
Initial	0.686 ± 0.045	6.419 ± 1.014	-
SyN [1]	0.821 ± 0.015	4.655 ± 0.592	<0.0001
NiftyReg [1]	0.813 ± 0.012	4.589 ± 0.652	<0.0001
VoxelMorph [5]	0.808 ± 0.017	4.804 ± 0.604	0.632 ± 0.222
CycleMorph [8]	0.822 ± 0.016	4.644 ± 0.623	0.030 ± 0.028
ViT-V-Net [19]	0.817 ± 0.018	4.851 ± 0.633	0.588 ± 0.199
TransMorph [6]	0.825 ± 0.016	4.688 ± 0.623	0.479 ± 0.143
TransMorph-diff [6]	0.798 ± 0.017	5.549 ± 0.582	<0.0001
TransMatch [56]	0.819 ± 0.018	4.576 ± 0.565	0.064 ± 0.028
RegMamba	0.839 ± 0.014	4.324 ± 0.505	0.055 ± 0.022

Table 3. Ablation study 1. The arrows indicate that the larger the value the more effective the theory is.

Method	Model	C	DCRM	Mamba	Scale Conv	Dice↑
A0	U-Net	8	-	-	-	76.16 (4.08)
A1	RegMamba	8	Y	Y	N	76.92 (3.84)
A2	RegMamba	8	Y	N	N	76.99 (3.82)
A3	RegMamba	8	N	Y	Y	76.95 (3.87)
A4	RegMamba	8	Y	Y	Y	77.06 (3.73)
B1	RegMamba	16	Y	Y	Y	77.35 (3.71)
B2	RegMamba	32	Y	Y	Y	77.46 (3.70)

Table 4. Ablation study 2. The arrows indicate that the larger the value, the better the theoretical effect.

Method	Model	C	DCRM	SCMB	Dice↑
A0	U-Net	8	-	-	72.6 (1.23)
A1	RegMamba	8	Y	Y	76.7 (1.24)
A2	RegMamba	8	Y	N	75.3 (1.23)
A3	RegMamba	8	N	Y	76.5 (1.24)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, X.; Chen, J.; Chen, Y. RegMamba: An Improved Mamba for Medical Image Registration. Electronics 2024, 13, 3305. https://doi.org/10.3390/electronics13163305

AMA Style

Hu X, Chen J, Chen Y. RegMamba: An Improved Mamba for Medical Image Registration. Electronics. 2024; 13(16):3305. https://doi.org/10.3390/electronics13163305

Chicago/Turabian Style

Hu, Xin, Jiaqi Chen, and Yilin Chen. 2024. "RegMamba: An Improved Mamba for Medical Image Registration" Electronics 13, no. 16: 3305. https://doi.org/10.3390/electronics13163305

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RegMamba: An Improved Mamba for Medical Image Registration

Abstract

1. Introduction

2. Related Work

2.1. Prior Works on Medical Image Registration

2.2. Mamba in Vision Tasks

2.3. Diffeomorphic Registration

3. Method

3.1. Network Overview

3.2. Scaled Convolutional Mamba Block

3.3. Deformable Convolutional Residual Module

3.4. Loss Function

4. Experiments

4.1. Data and Pre-Processing

4.2. Evaluation Metrics

4.3. Baseline Methods

4.4. Experimental Settings

4.5. Results

Ablation Study

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI