RFTNet: Region–Attention Fusion Network Combined with Dual-Branch Vision Transformer for Multimodal Brain Tumor Image Segmentation

Jiao, Chunxia; Yang, Tiejun; Yan, Yanghui; Yang, Aolin

doi:10.3390/electronics13010077

Open AccessArticle

RFTNet: Region–Attention Fusion Network Combined with Dual-Branch Vision Transformer for Multimodal Brain Tumor Image Segmentation

¹

School of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China

²

School of Artificial Intelligence and Big Data, Henan University of Technology, Zhengzhou 450001, China

³

Key Laboratory of Grain Information Processing and Control (HAUT), Ministry of Education, Zhengzhou 450001, China

⁴

Henan Key Laboratory of Grain Photoelectric Detection and Control (HAUT), Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(1), 77; https://doi.org/10.3390/electronics13010077

Submission received: 23 November 2023 / Revised: 20 December 2023 / Accepted: 21 December 2023 / Published: 23 December 2023

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Brain tumor image segmentation plays a significant auxiliary role in clinical diagnosis. Recently, deep learning has been introduced into multimodal segmentation tasks, which construct various Convolutional Neural Network (CNN) structures to achieve excellent performance. However, most CNN-based segmentation methods have poor capability for global feature extraction. Transformer is good at modeling long-distance dependencies, but it can cause local information loss and usually has a high computational complexity. In addition, it is difficult to fully exploit the brain tumor features of different modalities. To address these issues, in this paper, we propose a region–attention fusion (RAF) network that combines a dual-branch vision Transformer (DVT), called RFTNet. In RFTNet, the DVT is exploited to capture the delicate local information and global semantics separately by two branches. Meanwhile, a novel RAF is employed to effectively fuse the images of the different modalities. Finally, we design a new hybrid loss function, called region-mixed loss function (RML) to calculate the importance of each pixel and solve the problem of class imbalance. The experiments on BrasTS2018 and BraTS2020 datasets show that our method obtains a higher segmentation accuracy than other models. Furthermore, ablation experiments prove the effectiveness of each key component in RFTNet.

Keywords:

deep learning; brain tumor segmentation; multimodal fusion; Transformer

1. Introduction

Brain tumors are among the most dangerous brain diseases, with a high mortality and morbidity [1]. Gliomas are common primary tumors of the brain, which are completely removed by surgical operation. Automated brain tumor image segmentation can help physicians locate lesions quickly and make correct judgments, which is of great clinical importance. Because of the good soft tissue resolution in magnetic resonance (MR) imaging, it is commonly used in the diagnosis of brain tumors [2]. There are four main modalities of brain MR images: T1, T2, T1ce, and Flair. In the T1 modality, the anatomical structure is prominent. The lesion area in the T2 modality is more visible. The T1ce modality can be used to observe the tumor margins to distinguish tumors from non-tumorous lesions. Compared with the T2 modality, the Flair modality can suppress the high signals in cerebrospinal fluid, so the lesions near the cerebrospinal fluid can be displayed [3]. As a result, the Flair modality can effectively show the surrounding situation of the tumor and present the edema region. It is not comprehensive to learn only one modality when segmenting brain tumor images. Multimodal image information can effectively complement each other, which can help to improve the segmentation accuracy [4]. Therefore, most automated segmentation tasks of brain tumors are based on multimodal images.

The complexity of the brain structure makes it a difficult task for multimodal automatic segmentation. Firstly, the brain tumor includes healthy tissue, regions of edema, and the core tumor [5]. It is hard to distinguish the relevant tumor regions due to the high similarity between them. Moreover, the boundaries between some lesion regions and soft tissues are blurred, making the accurate identification of tumor contours difficult. In addition, there is a class imbalance in the brain tumor segmentation task, which makes the training process difficult. It is essential to find an effective segmentation method for multimodal brain tumor MR images.

A lot of researchers have proposed excellent methods to complete the segmentation task for brain tumors. In recent years, image segmentation methods based on the convolutional neural network (CNN) have been increasingly used. It converts the pixel label prediction into a classification problem of local image blocks, which can effectively capture the details of the image. U-Net [6] provides a good architecture for CNN-based image segmentation methods. The encoding–decoding architecture adequately exploits the local information at shallow layers and the semantic information at deep layers. However, there are still some limitations of CNNs. The pooling layer may lose important information and the correlation between local and global features is ignored. The inherent limitations of convolution kernels also make it hard to learn global semantic information. Recently, Dosovitskiy et al. [7] proposed Vison Transformer (VIT). The self-attention mechanism of VIT is not limited by local interactions. It can obtain global information by modeling long-distance dependencies. Nevertheless, VIT requires higher computational complexity and has a poor ability to capture local features. For effectively fusing brain tumor features of different modalities, a lot of researchers have also proposed multimodal feature fusion strategies. In general, there are three image fusion architectures: pixel-level, feature-level, and decision-level. The pixel-level fusion method can fuse information on the data layer, but it requires high registration accuracy of the source image and has a heavy workload. The feature-level fusion method first extracts the features of the input image and then fuses these feature vectors according to certain rules. It reduces the workload but loses more details compared with the pixel-level fusion method, which can be compensated for by improving the feature extraction capability. The decision-level fusion method feeds the images of different modalities into the segmentation network individually and then integrates the output to obtain an overall decision. This approach can learn complementary features from different modalities but requires more memory. Usually, we use the feature-level fusion method to fuse the multimodal image features.

In this paper, we focus on the segmentation task for multimodal brain tumor MR images. We propose RFTNet. The main contributions of our work are the following:

To comprehensively capture delicate local information and global semantic information, we combine 3D CNNs and the dual-branch vision Transformer (DVT) for feature extraction.
We propose a region–attention fusion module (RAF) to fuse the feature information of different modalities. According to the contribution of each modality to the relevant region, the RAF aggregates more useful multimodal features to improve the segmentation accuracy.
To solve the problem of class imbalance and consider the valuable pixel information, we design a region-mixed loss function (RML) based on the characteristics of brain tumors.

We organized the rest of this paper as follows. The research related to this study is introduced in Section 2. The specific implementation in this study is provided in Section 3. In Section 4, the datasets, preprocessing steps, evaluation metrics, and experimental configurations are presented. We also introduce our comparison and ablation experiments in Section 4. The discussion and conclusion are provided in Section 5.

2. Related Work

We overview the related research from three aspects. Firstly, we review the CNNs used in medical image segmentation tasks. Then, we introduce various Transformer-based image segmentation methods. Finally, we outline the fusion strategies in multimodal image segmentation tasks.

2.1. CNN-Based Segmentation Methods

In previous CNN-based segmentation methods, the 2D CNN structure was adopted as the main backbone. 3D-CNN [8] models can be obtained by converting 2D convolution kernels into 3D convolution kernels. For voxel segmentation, 3D networks can combine the information between image layers and preserve the full-volume image, which provides a better segmentation performance compared with 2D networks. Therefore, most current research is based on 3D models. With the use of U-Net in medical image segmentation, brain tumor segmentation tasks based on CNNs have achieved good implementations. To balance the local features and global semantics, some studies have been devoted to building deep networks [9]. Other studies have extracted features from multi-scales. Zhang et al. [10] introduced short-circuit residual modules to extract multi-scale information. They proposed a mesh aggregation strategy to consider the semantic relationships between adjacent convolution blocks when aggregating features at different levels. Rehman et al. [11,12] added extended skip connections between the encoder and decoder to increase the effective receptive field and obtain multi-scale features. Zhou et al. [13] constructed a 3D transverse convolution feature pyramid at the end of the network. Similarly, Wang et al. [14] used a spatially expanded feature pyramid to exploit the multi-scale feature information. To reduce information loss, Rehman et al. [15] used residual spatial pyramid pooling to capture multi-scale features and exploited the attention gate to emphasize the effective features. There are also studies that have extracted features of different resolutions from multiple paths. To extract multi-scale spatial feature information, Peng et al. [16] introduced multiple 3D U-Net blocks. Zhu et al. [17] exploited two encoders to extract input features from two scales. Based on DenseNet [18] and feature pyramids, Fang et al. [19] designed a dual-path architecture to fuse the feature information at different levels, which makes tumor structural features more diverse. Other studies have modeled long-distance dependencies through attention mechanisms. For example, non-local neural networks [20] directly model long-distance dependencies by computing the interaction between any two locations.

In the above work, models based on deep networks will cause information redundancy, and some important local information may be discarded as the layers become deeper. Models based on multi-scale features do not effectively balance multi-scale features easily. Models based on multi-path have a low computational efficiency and a large amount of memory. Self-attention mechanism-based models focus on the local–global relationship, but their calculation cost is high and require a large amount of memory.

2.2. Vision Transformer-Based Segmentation Methods

The self-attention mechanism in Transformer makes it good at capturing global dependencies, which is significant for semantic segmentation. Therefore, more and more medical segmentation studies have attempted methods based on Transformer. Zhang et al. [21] proposed a network with a parallel combination of 2D CNNs and Transformer, and fused the features of the two branches by the designed fusion module. To efficiently train Transformer, Valanarasu et al. [22] introduced a gated axial attention module in the self-attention mechanism. Xie et al. [23] designed a mutable Transformer in the encoder to build the long-distance dependencies of the feature mapping extracted by 3D CNNs. The mutable Transformer only focuses on some significant locations, leading to a lower computational complexity. UNETR [24] used Transformer in the whole encoder to extract 3D brain tumor image features, where every three blocks is a stage, and the features of the different resolutions are connected to the decoder by skip connections. Through these means, the global contextual information was extracted from multiple scales. Similarly, Liu et al. [25] proposed Swin UNETR, the encoder of which was Swin Transformer [26] instead of standard convolutions, and each stage consisted of two blocks. Peiris et al. [27] designed a volumetric Transformer, whose self-attention mechanism was used to capture local and global features. The decoder exploited self-attention and cross-attention to capture delicate information. Li et al. [28] proposed adaptive tokens to model the global context information and reduce computational complexity. Zhou et al. [29] introduced a combination of self-attention mechanisms and interleaved convolutions and exploited local and global self-attention mechanisms to learn spatial features. Lee et al. [30] set tokens of different sizes and input them into the Transformer through multiple paths. Through this method, multi-scale features can be obtained at the same feature level.

Although VIT-based image segmentation methods have achieved better performances in different medical tasks, they still need to be improved. Firstly, in some models, the encoder or entire encoder–decoder structure is only made by Transformer. It improves the capability of global feature extraction but reduces the ability to capture details. Moreover, because Transformer does not have a strong inductive bias like convolution operations, it is often not as good as CNN when missing training data. To address these issues, other models have combined CNNs and Transformer to exploit the advantages of both while reducing the network complexity. However, their Transformer structures are difficult to adapt to local features extracted by shallow CNNs. They do not consider the relationship between the feature representations obtained from Transformer and CNNs, respectively. Different to previous work, we combined the DVT and 3D CNNs to extract image features. Our DVT is designed as two branches that focus on local information and global semantics, respectively, and also complete the interaction between them.

2.3. Multimodal Feature Fusion

Feature fusion can effectively exploit the characteristics of different modalities to make the segmentation results more accurate. A lot of work has been devoted to researching multimodal image fusion methods. Dolz et al. [31] took the idea of dense connection. Their network can learn information between different modalities and different levels. Syazwany et al. [32] used a bi-directional feature pyramid network to fuse multimodal as well as multiscale features. Zhou et al. [33] exploited the spatial attention mechanism, the channel attention mechanism, and the correlation attention mechanism to guide the network to learn potentially relevant features. Similarly, Liu et al. [34] introduced channel and spatial dual attention mechanisms to fuse 3D multimodal information, and designed specific loss functions for different modality features. Veshki et al. [35] divided the images into relevant and irrelevant components. For the relevant components, the maximum absolute value rule was adopted. And they retained irrelevant components to maintain the modality-specific information. Recently, several studies have exploited the Transformer for image fusion. Zhang et al. [36] designed a cross-modal Transformer to build long-distance dependencies between different modalities. Ma et al. [37] introduced Swin Transformer to fuse the images of different modalities, which can achieve the global interaction of complementary information. Tang et al. [38] used adaptive Transformer and convolution to complete multimodal medical image fusion.

There are still some challenges in fusing multimodal image features effectively. The fusion models based on multi-scale and attention mechanisms ignore the correlation between multimodal images. The fusion models based on Transformer lack spatial information and have high complexity. In this paper, we focus on the sensitivity of each modality to relevant regions and extract the most important modality features in each region. Based on this, we design a region-based fusion module RAF, which can better exploit the characteristics of different modalities.

3. Method

In this section, we first introduce the overall architecture of the network. Then, the designed dual-branch vision Transformer (DVT), region–attention fusion module (RAF), and region-mixed loss function (RML) are presented, respectively.

3.1. Network Architecture

We outlined the overall architecture of our method in Figure 1. RFTNet uses 3D CNNs and the dual-branch vision Transformer to extract image features. Four encoders are used for four brain MR images, respectively. Given the input image X ∈

R^{C \times H \times W \times Z}

, the channel number of which is

C

, spatial resolution is

H \times W

, and depth is

Z

, our network first extracts the volume feature space by 3D CNNs. After each down-sampling, the image size is reduced to X ∈

R^{\frac{H}{2} \times \frac{W}{2} \times \frac{Z}{2}}

. At the last level of the convolution encoder, the network encodes the information into a feature representation of F ∈

R^{\frac{H}{8} \times \frac{W}{8} \times \frac{Z}{8}}

by a 3 × 3 × 3 down-sampling convolution block. Then, the network feeds F into the DVT to model the long-distance dependencies. At each level of the whole encoder, the network aggregates the shallow location information and the deep semantic information by skip connection, and also fuses the features from different modalities by our RAF. In the decoder, 3D convolutions are used to complete up-sampling.

3.2. Dual-Branch Vision Transformer

As shown in Figure 2, the high-resolution features from the convolution layer are extracted by the dual-branch block and merge block. In the dual-branch block, there are two branches to focus on the finer local information and the global semantics, respectively. For the input feature maps, the dual-branch block first extracts high-level semantic tokens from the global perspective by a deeper semantic branch. Afterward, the semantic branch puts these semantic tokens into the pixel branch in the form of keys/values. In this way, the dependency of internal information on global semantics is strengthened, and the complexity of multi-head self-attention is reduced. To realize the internal interaction between local tokens in the pixel branch, the merge block concatenates the final outputs from two branches, feeds them into the multi-head self-attention layer, and then puts them into different feed-forward layers, respectively.

Formally, for the input feature

x_{l - 1}

(

l \in [1,5]

) of the

l

-th block, we define the input of the pixel branch as

x_{l - 1}

, the input of the semantic branch as

z_{l - 1}

(

z_{l - 1} = x_{l - 1}

), the output of the pixel branch as

x_{l}^{″}

, and the output of the semantic branch as

z_{l}^{″}

. The specific expression for the semantic branch is as follows:

z_{l}^{'} = M H A (L (L N (z_{l - 1})), L (L N (z_{l - 1})), L (L N (z_{l - 1}))) + z_{l - 1}

(1)

{\tilde{z}}_{l} = M H A ({L (L N (z}_{l}^{'})), {L (L N (x}_{l - 1})), L ({L N (x}_{l - 1}))) + z_{l}^{'}

(2)

z_{l}^{″} = F F N (L N ({\tilde{z}}_{l})) + {\tilde{z}}_{l}

(3)

The specific expression for the pixel branch is as follows:

x_{l}^{'} = M H A (L (L N (x_{l - 1})), L (L N (z_{l}^{''})), L (L N (z_{l}^{''}))) + x_{l - 1}

(4)

x_{l}^{″} = F F N ({L N (x}_{l}^{'})) + x_{l}^{'}

(5)

where

L N

denotes layer normalization.

L

denotes linear transformation.

M H A

represents the multi-head attention layer.

F F N

represents the feed-forward layer, which includes two fully connected layers, an activation function, and a convolution layer. The parameters in

M H A

are defined as

q

(query),

k

(key), and

v

(value). They are used for self-attention. The specific expression of

M H A

is as follows:

M H A (q, k, v) = s o f t m a x (\frac{q k^{T}}{\sqrt{d_{k}}}) v

(6)

where

d_{k}

is the dimension of

k

, which is used to reduce the attention weight.

In the merge block, it takes the output of two branches as input. For the

l

-th block, we define the output of the pixel branch as

x_{l}

, define the output of the semantic branch as

z_{l}

. The specific expressions for the merge block are as follows:

{\overset{̿}{x}}_{l}, \overset{̿}{z}_{l} = S (M H A (L N (x_{l}^{''} | | z_{l}^{''})) + (x_{l}^{″} | | z_{l}^{''})

(7)

x_{l} = F F N ({L N (\overset{̿}{x}}_{l})) + {\overset{̿}{x}}_{l}

(8)

z_{l} = F F N ({L N (\overset{̿}{z}}_{l})) + {\overset{̿}{z}}_{l}

(9)

where

(| |)

denotes tensor concatenation,

S

denotes separating the values of the two branches.

We set up four stages in the DVT. In the first stage, we exploit two dual-branch blocks. The head number, expansion rate of the feed-forward layer, and channel dimension are 2, 8, and 64, respectively. In the second stage, we exploit three dual-branch blocks. The head number, expansion rate of the feed-forward layer, and channel dimension are 4, 8, and 128, respectively. In the third stage, we use five merge blocks. The head number, expansion rate of the feed-forward layer, and channel dimension are 6, 4, and 192, respectively. In the fourth stage, we use three merge blocks. The head number, expansion rate of the feed-forward layer, and channel dimension are 4, 3, and 128, respectively. After our DVT, the size of the image is reduced to

\frac{H}{16} \times \frac{W}{16} \times \frac{Z}{16}

. With the combination of 3D CNNs and DVT, our network can effectively extract the image feature information, which is important for the entire segmentation work.

3.3. Multimodal Region–Attention Fusion Module

As shown in Figure 3, we propose the region–attention fusion module (RAF) to combine the most important features of different modalities. RAF first divides the images of different modalities into different target regions, which is the most important step in the whole fusion process. Specifically, the target region is divided into the background region (BG), edema region (ED), enhancing tumor region (ET), and non-enhancing tumor/gangrenous region (NET/NT). Under the supervision of ground truth, the channel and spatial attention mechanisms are used to precisely divide these regions. The channel attention focuses on the channels that are important in the input feature map, while the spatial attention focuses on the useful pixel regions. Introducing the attention mechanism in these two dimensions helps improve the learning accuracy of region probability, which is important for the subsequent region division task. Then, the output of the dual-attention module is put into the region fusion module. The module concatenates it and the features from the previous layer (except the last layer) in the decoder. After that, the probabilities of target regions are learned under the supervision of ground truth. The input feature maps are divided into different regions by element-wise multiplication with region probabilities.

The specific implementation of region division is mathematically defined as the following:

f_{i, j}^{'} = M_{c} (f_{i, j}) \otimes f_{i, j}

(10)

f_{i, j}^{″} = M_{s} (f_{i, j}^{'}) \otimes f_{i, j}^{'}

(11)

{\hat{y}}_{i, j}^{f p} = \frac{e x p (Φ_{j} (f_{i, j}^{''}; θ_{j}))}{Ʃ_{r \in R} e x p ({Φ_{j} (f_{i, j}^{''}; θ_{j})}_{r})}

(12)

f_{r} = f_{i, j} \cdot {\hat{y}}_{r}^{f p}

(13)

where

f_{i, j}

represents the features from the current encoder layer and previous decoder layer (except the last decoder layer).

i

denotes the

i

-th subject.

j

denotes the

j

-th layer. ⊗ means element-wise multiplication.

M_{c}

denotes the channel attention map.

f_{i, j}^{'}

denotes the channel attention feature representation.

M_{s}

represents the spatial attention map.

f_{i, j}^{″}

represents the feature representation after the channel and spatial two attention mechanisms.

θ_{j}

denotes the relevant parameter and

Φ_{j}

denotes the region division at layer

j

.

R

denotes the set of regions to be divided, including BG, ED, ET, and NET/NT.

{\hat{y}}_{i, j}^{f p}

is the probability of region features.

{\hat{y}}_{r}^{f p}

indicates the probability of features in region

r

. · denotes multiplication.

f_{r}

represents the feature division of region

r

.

After dividing the images into corresponding target regions, the multimodal feature information is adaptively fused by learning the importance of different modalities in each region. Specifically, the average pooling is firstly exploited to obtain the global feature

f_{r}

in region

r

. After the regularization through averaging pooling operation, it is divided by the probability map

{\hat{y}}_{i, j}^{f p}

. Then,

f_{r}

and

{\hat{y}}_{i, j}^{f p}

are concatenated. After two fully connected layers, Leaky ReLU, and sigmoid activation function, the modal attention weights of region

r

are obtained. Then, the weights are multiplied with the modal features to adjust the contributions of different modalities, and the modality that is most sensitive to region

r

is used for region feature fusion. With this method, the network can exploit the characteristics of different modal images and combine their strengths to obtain a more effective feature representation. Finally, the fused features in each region are concatenated to generate the final multimodal image fusion features. The specific implementation can be mathematically defined as follows:

f = ©_{r \in R} \sum_{m \in Ω} f_{r}^{m} \cdot w_{r}

(14)

where

f_{r}^{m}

represents the feature representation of modality

m

in region

r

.

w_{r}

denotes the modal attention weights in region

r

.

f

represents the fused feature representation of the RAF.

3.4. Region-Mixed Loss Function

Due to the small size of brain tumors throughout the brain, there often occurs class imbalance in the segmentation process. Moreover, the boundaries of relevant tumor regions are very blurred, which makes it difficult to distinguish different regions. To solve these problems, we designed a new hybrid loss function, which combines Dice loss [39], weighted cross-entropy loss [40], and region loss [41]. The specific representation is as follows:

L = L_{D L} + L_{W C E} + L_{R W}

(15)

where

L_{D L}

denotes Dice loss.

L_{W C E}

represents weighted cross-entropy loss.

L_{R W}

indicates region-wise loss. Assume that the predicted value is

\hat{y}

, the ground truth is

y

. The specific description of the three functions are as follows:

L_{D L} (\hat{y}, y) = 1 - \sum_{r \in R} \frac{2 \cdot L_{1} ({\hat{y}}_{r} \cap y_{r})}{R_{n u m} \cdot (L_{1} ({\hat{y}}_{r}) + L_{1} (y_{r}))}

(16)

where

\cap

means the overlap between

\hat{y}

and

y

.

L_{1}

is the

L_{1}

regularization.

R_{n u m}

denotes the number of regions in

R

. For small targets like brain tumors, Dice loss can appropriately address the problem of unbalanced pixel counts.

L_{W C E} (\hat{y}, y) = \sum_{r \in R} \frac{L_{1} (- α_{r} \cdot y_{r} \cdot l o g ({\hat{y}}_{r}))}{H \cdot W \cdot Z}

(17)

where

α_{r}

denotes the weight of region

r

. The cross-entropy loss can avoid the vanishing gradient problem to some extent and effectively prevent over-fitting. The weighted cross-entropy loss function can solve the class imbalance problem.

z_{i k} = \{\begin{matrix} \frac{- {| | i - b_{i k} | |}_{2}}{{m a x}_{j \in Ω_{k}} {| | j - b_{j k} | |}_{2}} i f i \in Ω_{k} \\ 1 e l s e \end{matrix}

(18)

L_{R W} (\hat{y}) = \sum_{k = 1}^{4} (\sum_{i = 1}^{n} {\hat{y}}_{i k}^{T} \cdot z_{i k})

(19)

where

k

represents four classes, which are four regions that need to be segmented.

b_{i k}

/

b_{j k}

is the boundary pixel of ground truth that is nearest to

i / j

in class

k

.

{| | | |}_{2}

denotes the Euclidean distance.

z_{i k}

denotes the region-wise map of voxel i in class

k

. With

z_{i k}

, it can be known that voxel

i

is closer to the center or the boundary of the target region.

{\hat{y}}_{i k}^{T}

is the transposition of the predicted voxel

i

in class

k

. In each class,

L_{R W} (\hat{y})

can be penalized according to the distance between voxel and boundary. Therefore, it can focus on both class imbalance and pixel importance.

4. Experimental Results

In this section, a detailed description of the experiment is provided. Firstly, we introduce the datasets, preprocessing steps, evaluation metrics, and implementation details. Secondly, we compared our method with some related studies proposed in recent years to verify the performance of our method. Finally, we demonstrate the effectiveness of the proposed modules through the ablation experiment.

4.1. Datasets and Preprocessing

We used the Brain Tumor Segmentation Challenge (BraTS) datasets organized by MICCAI in 2018 and 2020. They can be abbreviated as the BraTS2018 dataset and the BraTS2020 dataset. In the BraTS2018 dataset, there are 285 cases in the training set and 66 cases in the validation set. In the BraTS2020 dataset, there are 369 cases in the training set and 125 cases in the validation set. For the two datasets, each case in the training sets contains Flair, T1ce, T1, and T2 four modalities and labels of voxel ground truth, labeled by physicians. The labels can be named background (BG, label 0), non-enhancing tumor/gangrenous region (NET/NT, label 1), edema region (ED, label 2), and enhancing tumor region (ET, label 4). Label 1 and label 4 make up the tumor core (TC) region. Label 1, label 2, and label 4 make up the complete tumor (WT) region. The validation sets have no labels. The test sets are not publicly available. The sizes of the original images are all 240 × 240 × 155.

During image preprocessing, the data collector registers the original images, resamples them to a resolution of 1 mm³, and strips off the skull. Following Adam [42], we remove the excess black background regions and also normalize all the MR images to the zero mean and unit variance in the brain regions. Before training, the image size is randomly cropped to 80 × 80 × 80. The mirror flipping, random rotation, and intensity offset are used to increase the diversity of the samples.

4.2. Evaluation Metrics

We use the Dice similarity coefficient and Hausdorff distance (95%) to evaluate our method. The Dice similarity coefficient can be used to assess the overlap between our predicted results and the ground truth. Its value range is from 0 to 1. It can be described as follows:

D i c e ({\hat{y}}_{k}, y_{k}) = \frac{2 \cdot L_{1} ({\hat{y}}_{k} \cap y_{k})}{L_{1} ({\hat{y}}_{k}) + {L_{1} (y}_{k})}

(20)

where

{\hat{y}}_{k}

denotes the predicted value of the tumor class

k

.

y_{k}

is the ground truth of the tumor class

k

.

D i c e ({\hat{y}}_{k}, y_{k})

represents the Dice score of the tumor class

k

. Class

k

includes WT, TC, and ET. The closer the value is to 1, the higher the similarity between the two sets.

The Hausdorff distance (95%) describes the degree of similarity between the ground truth and our predicted results, 95% indicates that 95% of the Hausdorff distance is taken to avoid the effect of a small number of outliers on the metrics. It can be mathematically described as follows:

H D ({\hat{y}}_{k}, y_{k}) = m a x \{{\max \min}_{{\hat{y}}_{\bar{k}} \in {\hat{y}}_{k}, y_{\bar{k}} \in y_{k}} d ({\hat{y}}_{\bar{k}}, y_{\bar{k}}), {\max \min}_{y_{\bar{k}} \in y_{k}, {\hat{y}}_{\bar{k}} \in {\hat{y}}_{k}} d (y_{\bar{k}}, {\hat{y}}_{\bar{k}})\}

(21)

where

{\hat{y}}_{\bar{k}}

denotes the elements in

{\hat{y}}_{k}

.

y_{\bar{k}}

denotes the elements in

y_{k}

.

d

represents the Euclidean distance.

H D ({\hat{y}}_{k}, y_{k})

represents the Hausdorff distance of the tumor class

k

. The smaller the Hausdorff distance, the closer the two sets are, indicating a more accurate segmentation result.

4.3. Implementation Details

Our network is implemented based on the Pytorch framework and Python language. The experimental configuration is a single NVIDIA GTX 3080 GPU (12 GB) and an Intel i7-10700 CPU at 3.20 GHz. We separately exploit two training sets to train our model. We use two validation sets to compare our method with other advanced methods. The predicted results are uploaded to the official website of the challenge for online validation. In the ablation experiment, we randomly selected 220, 49, and 100 cases from the BraTS2020 training set to train, validate, and test our model. During the training process, Adam [43] was used to optimize our network, with a weight decay of

{1 \times 10}^{- 5}

. The initial learning rate is set to

{2 \times 10}^{- 4} \times {(1 - \frac{e p o c h}{m a x_e p o c h})}^{0.9}

, the number of epochs is 300, and the batch size is 1.

4.4. Comparison with Other Methods

To verify the effectiveness of the segmentation method in this study, we compared our model with seven advanced methods. Among them, 3D U-Net [44] and V-Net [45] are the common basic architectures. UNet++ [46], DMFNet [47], and Akbar et al. [48] are improved architectures based on U-Net, which can extract feature information more effectively. To prove the superiority of our DVT, we also compared our model with TransUNet [49], which is based on VIT. To verify the superiority of our RAF, we also compared our model with mmFormer [36], which fuses the features from different modalities by Transformer. Table 1 and Table 2 present the average Dice and Hausdorff distances of these methods and our method on the BraTS2018 validation set and BraTS2020 validation set, respectively. The results are obtained from online validation on the official website. TC, ET, and WT denote the tumor core, enhancing tumor, and whole tumor, respectively. The bold font represents the best scores.

From Table 1, it can be seen that on the BraTS2018 validation set, we achieved 90.30%, 82.15%, and 80.24% Dice on WT, TC, and ET, respectively. The Hausdorff distance is 5.968 mm, 6.411 mm, and 3.158 mm, respectively. From Table 2, we can see that on the BraTS2020 validation dataset, the Dice on the three regions is 88.97%, 82.41%, and 78.95%, respectively. The Hausdorff distance is 9.465 mm, 7.289 mm, and 4.837 mm, respectively. Compared with 3D U-Net [44], UNet++ [46], V-Net [45], DMFNet [47], and Akbar et al. [48], we have obtained better scores on the two datasets, which indicates the advantage of the long-distance dependencies in our DVT. Compared with TransUNet [49] and mmformer [36], we have also obtained better results, especially in TC and ET. It shows the effectiveness of our CNN-DVT structure. Our RAF also can effectively fuse the features of different modalities, which makes our results more competitive. Our RML also helped improve the results. There are some fluctuating results in Table 1 and Table 2 between our method and other mentioned methods. The reason is that there are significant differences between the BraTS2018 dataset and the BraTS2020 dataset, which can lead to some differences in the segmentation results. In general, we have achieved better results on the two datasets, which demonstrates the effectiveness of our method in the multimodal brain tumor image segmentation task.

4.5. Ablation Study

To evaluate the different modules in the segmentation task, we conducted detailed ablation studies. Table 3 presents the average Dice and Hausdorff distances on the BraTS2020 training dataset. M0 to M7 represent the models that have added different modules (including DVT, RAF, and RML). M0 is the baseline model, and M15 is the model in this study. The bold font represents the best scores. The √ represents the introduction of the corresponding module.

The baseline used 3 × 3 × 3 convolutions to extract and aggregate the image features from four modalities. The loss function consists of weighted cross-entropy loss and Dice loss. On the BraTS2020 dataset, compared with the baseline model, the Dice increased by 3.55%, 5.99%, and 3.15% in three tumor regions, respectively. And the Hausdorff distances decreased by 8.137 mm, 8.736 mm, and 5.814 mm, respectively. It can be seen that the modules we designed are effective for brain tumor segmentation tasks. With the introduction of our DVT, the results of three evaluation metrics have improved. This is because DVT can better model the long-distance dependencies and fully combine local features and global information. With the RAF module, regions in the TC and ET obtained higher Dice scores. This demonstrates that our region–attention fusion module can fully utilize the characteristics of different modalities, resulting in the accurate classification of relevant tumor regions. The combination of RML and RAF reduced the Hausdorff distance of three tumor regions. It is because RML punishes each voxel based on the distance to the boundary, which improves the accuracy of the fusion results.

Figure 4 shows the visualization results of our ablation studies on the BraTS2020 dataset. From the upper left to lower right are the segmentation effects of the baseline model, the models after introducing different modules, our model that introduced all modules, and the ground truth. Red, blue, and green represent the non-enhancing tumor region, enhancing tumor region, and edema region. It can be seen that we achieved better segmentation results after introducing different modules. Especially for small tumor regions, the segmentation effect has been significantly improved.

5. Discussion and Conclusions

In this paper, we propose an automatic segmentation method for 3D multimodal brain tumor MR images. Firstly, we introduce the encoding structure, which is designed as four independent encoders for four brain modalities. The encoders consist of 3D CNNs and DVT. In the first three layers, the shallow local information is extracted by CNNs. In the last layer, DVT is used to capture global semantics. DVT has two branches, which focus on delicate pixel details and deeper global semantics, respectively. And it can build a dependency relationship between them. Between the encoder and decoder, four modalities are fused by RAF in the skip connection module. RAF effectively exploits the characteristics of the different modalities, which is effective for improving the segmentation accuracy. Finally, we design a novel hybrid loss function. It combines Dice loss, weighted cross-entropy loss, and region loss. It can consider the importance of pixels and solve the class imbalance problem. Experiments on the BraTS2018 and BraTS2020 datasets show that the multimodal brain tumor image segmentation method we proposed can extract image features more effectively, which fully utilizes the features among different modalities and also alleviates the problem of class imbalance to some extent.

Although we have achieved good results in the segmentation task for multimodal brain tumor MR images, missing modality often occurs in clinical practice, which has a large impact on the segmentation task. Therefore, it is crucial to find a segmentation method that can cope with the occurrence of missing modality. In future work, we will focus on researching brain tumor segmentation methods with missing modalities.

Author Contributions

Methodology, writing—original draft preparation, C.J. and T.Y.; writing—review and editing, C.J. and Y.Y.; resources, funding acquisition, T.Y.; data curation, C.J. and A.Y.; visualization, C.J.; supervision, C.J., T.Y. and Y.Y.; project administration, A.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 62106067, the key specialized research and development program of Henan Province, grant number 202102210170, the Open Fund Project of Key Laboratory of Grain Information Processing & Control, grant number KFJJ2021101, and the Innovative Funds Plan of Henan University of Technology, grant number 2021ZKCJ14.

Data Availability Statement

The datasets are provided by BraTS 2018 Challenge and BraTS 2020 Challenge and are allowed for personal academic research. The specific link to the dataset is https://ipp.cbica.upenn.edu/ (accessed on 15 November 2023).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of the data; in the writing of the manuscript; or in the decision to publish the results.

References

Işın, A.; Direkoğlu, C.; Şah, M. Review of MRI-based brain tumor image segmentation using deep learning methods. In Proceedings of the 12th International Conference on Application of Fuzzy Systems and Soft Computing, ICAFS 2016, Vienna, Austria, 29–30 August 2016; Elsevier B.V.: Amsterdam, The Netherlands, 2016; pp. 317–324. [Google Scholar]
Wadhwa, A.; Bhardwaj, A.; Verma, V.S. A review on brain tumor segmentation of MRI images. Magn. Reson. Imaging 2019, 61, 247–259. [Google Scholar] [CrossRef] [PubMed]
Menze, B.H.; Van Leemput, K.; Lashkari, D.; Weber, M.-A.; Ayache, N.; Golland, P. A generative model for brain tumor segmentation in multi-modal images. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2010, Proceedings of the 13th International Conference, Beijing, China, 20–24 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 151–159. [Google Scholar]
Zhou, T.; Ruan, S.; Canu, S. A review: Deep learning for medical image segmentation using multi-modality fusion. Array 2019, 3, 100004. [Google Scholar] [CrossRef]
Bauer, S.; Wiest, R.; Nolte, L.-P.; Reyes, M. A survey of MRI-based medical image analysis for brain tumor studies. Phys. Med. Biol 2013, 58, R97. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Kamnitsas, K.; Ferrante, E.; Parisot, S.; Ledig, C.; Nori, A.V.; Criminisi, A.; Rueckert, D.; Glocker, B. DeepMedic for brain tumor segmentation. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Proceedings of the Second International Workshop, Athens, Greece, 17 October 2016; Springer: Cham, Switzerland, 2017; pp. 138–149. [Google Scholar]
Mallat, S. Understanding deep convolutional networks. Philos. Trans. R. Soc. A-Math. Phys. Eng. Sci. 2016, 374, 20150203. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Lu, Y.; Chen, W.; Chang, Y.; Gu, H.; Yu, B. MSMANet: A multi-scale mesh aggregation network for brain tumor segmentation. Appl. Soft. Comput. 2021, 110, 107733. [Google Scholar] [CrossRef]
Rehman, M.U.; Cho, S.; Kim, J.H.; Chong, K.T. Bu-net: Brain tumor segmentation using modified u-net architecture. Electronics 2020, 9, 2203. [Google Scholar] [CrossRef]
Rehman, M.U.; Cho, S.; Kim, J.; Chong, K.T. Brainseg-net: Brain tumor mr image segmentation via enhanced encoder–decoder network. Diagnostics 2021, 11, 169. [Google Scholar] [CrossRef]
Zhou, Z.; He, Z.; Jia, Y. AFPNet: A 3D fully convolutional neural network with atrous-convolution feature pyramid for brain tumor segmentation via MRI images. Neurocomputing 2020, 402, 235–244. [Google Scholar] [CrossRef]
Wang, J.; Gao, J.; Ren, J.; Luan, Z.; Yu, Z.; Zhao, Y.; Zhao, Y. DFP-ResUNet: Convolutional neural network with a dilated convolutional feature pyramid for multimodal brain tumor segmentation. Comput. Meth. Programs Biomed. 2021, 208, 106208. [Google Scholar] [CrossRef]
Rehman, M.U.; Ryu, J.; Nizami, I.F.; Chong, K.T. RAAGR2-Net: A brain tumor segmentation network using parallel processing of multiple spatial frames. Comput. Biol. Med. 2023, 152, 106426. [Google Scholar] [CrossRef] [PubMed]
Peng, S.; Chen, W.; Sun, J.; Liu, B. Multi-scale 3d u-nets: An approach to automatic segmentation of brain tumor. Int. J. Imaging Syst. Technol. 2020, 30, 5–17. [Google Scholar] [CrossRef]
Zhu, Y.; Pan, X.; Zhu, J.; Li, L. Multi-scale strategy based 3d dual-encoder brain tumor segmentation network with attention mechanism. In Proceedings of the 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Seoul, Republic of Korea, 16–19 December 2020; pp. 952–957. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Fang, L.; Wang, X. Brain tumor segmentation based on the dual-path network of multi-modal MRI images. Pattern Recognit. 2022, 124, 108434. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Zhang, Y.; Liu, H.; Hu, Q. Transfuse: Fusing transformers and cnns for medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2021; Springer: Cham, Switzerland, 2021; pp. 14–24. [Google Scholar]
Valanarasu, J.M.J.; Oza, P.; Hacihaliloglu, I.; Patel, V.M. Medical transformer: Gated axial-attention for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021, Proceedings of the 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Springer: Cham, Switzerland, 2021; pp. 36–46. [Google Scholar]
Xie, Y.; Zhang, J.; Shen, C.; Xia, Y. Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021, Proceedings of the 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part III 24; Springer: Cham, Switzerland, 2021; pp. 171–180. [Google Scholar]
Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 574–584. [Google Scholar]
Hatamizadeh, A.; Nath, V.; Tang, Y.; Yang, D.; Roth, H.R.; Xu, D. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In Proceedings of the International MICCAI Brainlesion Workshop, Singapore, 27 September 2021; Springer: Cham, Switzerland, 2022; pp. 272–284. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Peiris, H.; Hayat, M.; Chen, Z.; Egan, G.; Harandi, M. A robust volumetric transformer for accurate 3d tumor segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2022, Proceedings of the 25th International Conference, Singapore, 18–22 September 2022, Proceedings, Part V; Springer: Cham, Switzerland, 2022; pp. 162–172. [Google Scholar]
Li, X.; Pang, S.; Zhang, R.; Zhu, J.; Fu, X.; Tian, Y.; Gao, J. ATTransUNet: An enhanced hybrid transformer architecture for ultrasound and histopathology image segmentation. Comput. Biol. Med. 2023, 152, 106365. [Google Scholar] [CrossRef] [PubMed]
Zhou, H.-Y.; Guo, J.; Zhang, Y.; Han, X.; Yu, L.; Wang, L.; Yu, Y. nnFormer: Volumetric Medical Image Segmentation via a 3D Transformer. IEEE Trans. Image Process. 2023, 32, 4036–4045. [Google Scholar] [CrossRef] [PubMed]
Lee, Y.; Kim, J.; Willette, J.; Hwang, S.J. Mpvit: Multi-path vision transformer for dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–25 June 2021; pp. 7287–7296. [Google Scholar]
Dolz, J.; Gopinath, K.; Yuan, J.; Lombaert, H.; Desrosiers, C.; Ayed, I.B. HyperDense-Net: A hyper-densely connected CNN for multi-modal image segmentation. IEEE Trans. Med. Imaging 2018, 38, 1116–1126. [Google Scholar] [CrossRef] [PubMed]
Syazwany, N.S.; Nam, J.-H.; Lee, S.-C. MM-BiFPN: Multi-modality fusion network with Bi-FPN for MRI brain tumor segmentation. IEEE Access. 2021, 9, 160708–160720. [Google Scholar] [CrossRef]
Zhou, T.; Ruan, S.; Vera, P.; Canu, S. A Tri-Attention fusion guided multi-modal segmentation network. Pattern Recognit. 2022, 124, 108417. [Google Scholar] [CrossRef]
Liu, Y.; Shi, Y.; Mu, F.; Cheng, J.; Li, C.; Chen, X. Multimodal mri volumetric data fusion with convolutional neural networks. IEEE Trans. Instrum. Meas. 2022, 71, 1–15. [Google Scholar] [CrossRef]
Veshki, F.G.; Ouzir, N.; Vorobyov, S.A.; Ollila, E. Multimodal image fusion via coupled feature learning. Signal Process. 2022, 200, 108637. [Google Scholar] [CrossRef]
Zhang, Y.; He, N.; Yang, J.; Li, Y.; Wei, D.; Huang, Y.; Zhang, Y.; He, Z.; Zheng, Y. mmFormer: Multimodal Medical Transformer for Incomplete Multimodal Learning of Brain Tumor Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022; Springer: Cham, Switzerland, 2022. [Google Scholar]
Ma, J.; Tang, L.; Fan, F.; Huang, J.; Mei, X.; Ma, Y. SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE-CAA J. Automatica Sin. 2022, 9, 1200–1217. [Google Scholar] [CrossRef]
Tang, W.; He, F.; Liu, Y.; Duan, Y. MATR: Multimodal medical image fusion via multiscale adaptive transformer. IEEE Trans. Image Process. 2022, 31, 5134–5149. [Google Scholar] [CrossRef] [PubMed]
Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Jorge Cardoso, M. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Proceedings of the Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec, QC, Canada, 14 September 2017; Springer: Cham, Switzerland, 2017; pp. 240–248. [Google Scholar]
Rezaei-Dastjerdehei, M.R.; Mijani, A.; Fatemizadeh, E. Addressing imbalance in multi-label classification using weighted cross entropy loss function. In Proceedings of the 2020 27th National and 5th International Iranian Conference on Biomedical Engineering (ICBME), Tehran, Iran, 26–27 November 2020; pp. 333–338. [Google Scholar]
Valverde, J.M.; Tohka, J. Region-wise loss for biomedical image segmentation. Pattern Recognit. 2023, 136, 109208. [Google Scholar] [CrossRef]
Dorent, R.; Joutard, S.; Modat, M.; Ourselin, S.; Vercauteren, T. Hetero-modal variational encoder-decoder for joint modality completion and segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019, Proceedings of the 22nd International Conference, Shenzhen, China, 13–17 October 2019; Springer: Cham, Switzerland, 2019; pp. 74–82. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations, Calgary, AB, Canada, 14–16 April 2014. [Google Scholar]
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016, Proceedings of the 19th International Conference, Athens, Greece, 17–21 October 2016; Springer: Cham, Switzerland, 2016; pp. 424–432. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Granada, Spain, 20 September 2018; Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Chen, C.; Liu, X.; Ding, M.; Zheng, J.; Li, J. 3D dilated multi-fiber network for real-time brain tumor segmentation in MRI. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019, Proceedings of the 22nd International Conference, Shenzhen, China, 13–17 October 2019; Springer: Cham, Switzerland, 2019; pp. 184–192. [Google Scholar]
Akbar, A.S.; Fatichah, C.; Suciati, N. Single level UNet3D with multipath residual attention block for brain tumor segmentation. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 3247–3258. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]

Figure 1. An illustration of the proposed RFTNet for multimodal brain tumor image segmentation.

Figure 2. The illustration of the proposed dual-branch vision Transformer (DVT). (a) Dual-branch block. (b) Merge block.

Figure 3. The illustration of region–attention fusion (RAF).

Figure 4. Visualization results of ablation study.

Table 1. Comparison of different methods on BraTS2018 validation set.

Methods	Dice Score (%)			Hausdorff Distance (mm)
Methods	WT	TC	ET	WT	TC	ET
3D U-Net [44]	88.53	71.77	75.96	17.100	11.620	6.040
UNet++ [46]	88.76	82.05	77.94	6.312	8.643	4.829
V-Net [45]	89.60	81.00	76.60	6.540	7.820	7.210
DMFNet [47]	89.90	83.50	78.10	4.860	7.740	3.380
TransUNet [49]	89.95	82.04	78.38	7.105	7.673	4.281
Akbar et al. [48]	89.59	79.77	77.71	9.130	8.670	3.900
mmformer [36]	89.56	83.33	78.75	4.434	8.038	3.274
Ours	90.30	82.15	80.24	5.968	6.411	3.158

Table 2. Comparison of different methods on BraTS2020 validation set.

Methods	Dice Score (%)			Hausdorff Distance (mm)
Methods	WT	TC	ET	WT	TC	ET
3D U-Net [44]	84.11	79.06	68.76	13.366	13.607	50.983
UNet++ [46]	88.21	82.28	77.95	9.012	8.406	7.039
V-Net [45]	86.11	77.90	68.97	14.490	16.150	43.520
DMFNet [47]	90.08	81.50	76.41	7.170	12.170	35.170
TransUNet [49]	88.46	78.75	78.16	6.679	12.948	12.159
Akbar et al. [48]	88.57	80.19	72.91	10.260	13.580	31.970
mmformer [36]	87.17	81.67	77.94	10.331	10.691	4.994
Ours	88.97	82.41	78.95	9.465	7.289	4.837

Table 3. Ablation study of each module in RFTNet.

Models	DVT	RAF	RML	Dice Score (%)			Hausdorff Distance (mm)
Models	DVT	RAF	RML	WT	TC	ET	WT	TC	ET
M0				86.82	78.80	78.60	16.183	16.337	11.508
M1	√			89.30	81.74	79.69	14.177	11.980	12.240
M2		√		89.25	84.21	80.45	18.626	19.215	15.867
M3			√	89.34	80.09	77.93	9.837	11.038	9.927
M4	√	√		89.44	83.74	80.99	19.750	23.036	14.895
M5	√		√	88.69	79.94	77.75	25.167	13.993	8.489
M6		√	√	89.50	83.45	80.15	12.930	8.147	6.770
M7	√	√	√	90.37	84.79	81.75	8.046	7.601	5.694

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiao, C.; Yang, T.; Yan, Y.; Yang, A. RFTNet: Region–Attention Fusion Network Combined with Dual-Branch Vision Transformer for Multimodal Brain Tumor Image Segmentation. Electronics 2024, 13, 77. https://doi.org/10.3390/electronics13010077

AMA Style

Jiao C, Yang T, Yan Y, Yang A. RFTNet: Region–Attention Fusion Network Combined with Dual-Branch Vision Transformer for Multimodal Brain Tumor Image Segmentation. Electronics. 2024; 13(1):77. https://doi.org/10.3390/electronics13010077

Chicago/Turabian Style

Jiao, Chunxia, Tiejun Yang, Yanghui Yan, and Aolin Yang. 2024. "RFTNet: Region–Attention Fusion Network Combined with Dual-Branch Vision Transformer for Multimodal Brain Tumor Image Segmentation" Electronics 13, no. 1: 77. https://doi.org/10.3390/electronics13010077

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RFTNet: Region–Attention Fusion Network Combined with Dual-Branch Vision Transformer for Multimodal Brain Tumor Image Segmentation

Abstract

1. Introduction

2. Related Work

2.1. CNN-Based Segmentation Methods

2.2. Vision Transformer-Based Segmentation Methods

2.3. Multimodal Feature Fusion

3. Method

3.1. Network Architecture

3.2. Dual-Branch Vision Transformer

3.3. Multimodal Region–Attention Fusion Module

3.4. Region-Mixed Loss Function

4. Experimental Results

4.1. Datasets and Preprocessing

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Comparison with Other Methods

4.5. Ablation Study

5. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI