1. Introduction
With the rapid development of deep learning and computer vision technology, medical image segmentation has been widely applied in assisted diagnosis and intelligent medical treatment, considerably enhancing the effectiveness and accuracy of diagnosis [
1]. It can extract meaningful semantic information from original medical image data, identify pixels of diseased organs, and obtain information features of these diseased parts [
2,
3,
4]. Early medical image segmentation methods, such as template matching technology, edge detection, and traditional machine learning technology, have achieved certain results; however, due to the difficulty of feature representation, medical image segmentation is still a challenging problem. Recently, good progress has been made in medical image segmentation based on deep learning, which has aroused a lot of attention. Among them, the method based on Convolutional Neural Network (CNN) is widely used in medical image segmentation and has achieved good results.
Based on the AlexNet [
5] network structure, the Fully Convolutional Network (FCN) [
6] converts the full connection layer into the convolutional layer, and increases the dimension of the feature map by means of up-sampling, so as to realize the end-to-end semantic segmentation at the pixel level. However, this method has limited capability in capturing fine-grained features. This creates a significant challenge for tasks that require precise segmentation (e.g., medical image segmentation). Ronneberger et al. [
7] proposed a landmark U-net network architecture on the basis of FCN. The encoding stage and decoding stage that correspond to each other in the internal components form the overall architecture of the network. In the encoding process, image features are extracted from the down-sampled images. In the process of decoding, images are up-sampled to gradually recover the size of the image [
8]. The pooling operation and continuous convolution kernel in the encoding stage result in the loss of some feature information of the image. However, in the decoding stage, the up-sampled feature map merges with the front-end information from the skip connections, thereby enriching the image’s detailed features. This U-shaped structure makes U-Net achieve a major breakthrough in various image segmentation work and produces many related research algorithms. For example, Unet3+ [
9] and Attention U-net [
10] enable the decoder to integrate more comprehensive feature information by adding the feature information of different scales or introducing attention mechanisms. The V-Net [
11] network integrates 3D convolution and U-Net architecture, and proposes a volumetric full-convolution 3D neural network segmentation method, which has achieved good results on prostate MRI images.
Although the CNN-based method has good local feature representation ability, due to the inherent restrictions of convolution transportation, it is challenging to learn global information and contextual semantic information interaction, so U-Net has certain limitations in remote relationship modeling and complex prediction tasks [
12,
13]. Some work has adopted the deep convolutional layer to expand the receptive field, introduce an attention mechanism into network structure, or adopt a pyramid network [
14,
15,
16,
17], etc., which may lead to further increases in model complexity and processing cost, and still need to further improve the ability of global information interactive capture.
Motivated by the prosperous implementation of Transformer in work related to NLP, the ViT [
18] algorithm makes only minor changes to the whole image classification process in an effort to directly apply the typical Transformer structure to pictures. The entire image will be broken up into smaller image blocks as part of the ViT algorithm, and the linear embedding sequence of these smaller image blocks will then be supplied to the net as the input of Transformer. When compared to other convolution-based algorithms, it reaches state-of-the-art (SOTA) performance. With the deepening of the research, the combination of CNN and Transformer has achieved good results. Chen et al. [
19] offered the Trans-Unet method in 2021, which generates a feature graph using CNN as a tool for extracting features and feeds it into Transformer to extract global data, and uses a cascading upper sampler to ensure accurate prediction. This algorithm has achieved excellent results in multiple medical image datasets. In 2021, Cao et al. proposed the Swin-Unet [
20] algorithm based on Transformer to be applied to the medical image segmentation task, which significantly decreases the computational burden of the algorithm and achieves high segmentation accuracy. However, because Swin-Unet can only learn single-scale context features in the training process, the amount of computation is too large and the induction bias module for processing local information is lacking [
21]. Some subsequent methods, such as nnFormer, MISSFormer, DSTransUnet, TransDeepLab, and DAE-Former [
22,
23,
24,
25,
26], adopt a pure Transformer architecture. Specifically, transformer blocks are employed on both the encoder and decoder sides to capture more global characteristics and fuse multi-scale information through skip connections.
The Transformer architecture model, distinguished from the local dependency bias of convolutional neural networks due to its excellent global modeling capabilities and ability to capture long-range dependencies, has garnered significant attention from researchers. However, its high computational cost has also deterred some. As a result, in recent years, research in medical image segmentation has shifted towards hybrid Transformer architectures with skip connections.
Zhao et al. argued that segmentation methods based on patch division might disrupt local coherent features and introduce noise. Building on this premise, they proposed a progressive sampling module [
27]. Unlike traditional segmentation models, this approach achieved promising results in small object segmentation tasks, although it did not perform as well in large object segmentation tasks. Rahman et al. innovatively combined visual transformers with graph convolution, proposing a graph-based cascaded attention architecture [
28] to address the limitations of visual transformers in handling local spatial information correlations. Zhou et al. introduced a 3D Transformer architecture called nnFormer [
29] based on multiple attention mechanisms, which balanced long-range and spatial dependencies in the self-attention mechanism. Besides, Huang et al. redefined the skip connections between the encoder and decoder, enhancing global dependencies further through a bridge-like connection method [
30].
Inspired by these works, we suggest an RL-Unet algorithm based on Swin-Unet to solve the aforementioned issues. It redesigns the up-sampling module and adds a local inductive bias module to the Swin-Transformer block, resulting in a densely connected double up-sampling module that can fully learn multi-scale information and increase the boundary segmentation accuracy of the algorithm. The suggested method has perfect segmentation accuracy and a strong generalization capability, according to extensive testing on the segmentation datasets for the synapse multiple abdominal organs and BraTS2021 datasets. Our contributions, specifically, can be summed up as follows:
We add a local induction bias module to the Swin-Transformer block to assist the RLSwin-Transformer module learn local features and remote dependencies, and replace the MLP module with the Res-MLP module to prevent feature loss during transmission, thus increasing segmentation accuracy.
We design a new double up-sampling module, which includes bilinear up-sampling and expansion convolution with different expansion rates for feature extraction and image restoration.
A dense connection structure is designed at the decoding end, and an attention mechanism is introduced to restore resolution and generate segmentation map.
To significantly increase the segmentation accuracy of small targets, a novel loss function is proposed.
The remainder of the paper is organized as follows: In
Section 2, we introduce the overall details of the proposed network structure and design the loss function.
Section 3 consists of a description of the datasets used, as well as evaluation scenarios and metrics, and discusses the experiments and results. The paper is finally summarized in
Section 4.
3. Experiments
3.1. Evaluation Metrics and Datasets
In this paper, the mean Dice-Similarity Coefficient (DSC) and mean Hausdorff Distance (HD) are utilized as evaluation measures. Both of them describe the similarity measures between two sample sets, but DSC focuses more on the segmentation accuracy of the internal filling part while HD is more sensitive to the segmentation boundary. They are defined as follows:
where
A represents the label diagram of the medical image and
B represents the prediction graph of algorithm segmentation. Equation (
24) is bidirectional HD, and Equations (
25) and (
26) are unidirectional HD from
A to
B and from
B to
A, respectively.
This paper verifies the segmentation effect of RL-Unet on the Synapse abdominal multi-organ segmentation, BraTS2021, ACDC, and BUSI datasets.
Synapse dataset: The abdominal CT scan dataset consists of 30 cases, focusing on the segmentation of various abdominal organs. Every CT scan consists of 85–198 slices of 512 × 512 pixels. We randomly partition the dataset into 18 scans for training (2212). Validation scans were 12 axial slices and 12 scans. We only segmented eight abdominal organs, including the aorta, gallbladder (GB), left kidney (KL), right kidney (KR), liver, pancreas (PC), spleen (SP), and stomach (SM).
BraTS2021 dataset: BraTS2021 includes multi-parameter MRI scans from 2000 patients. MRI scans have four modal data sizes: Flair, t1ce, t1, and t2, each of which is 240 × 240 × 155 shared segmentation labels. The dataset was arbitrarily split into a training set, verification set and test set at a ratio of 8:1:1. Methods of data enhancement were pruning, rotation, flipping, Gaussian noise, contrast transformation, and brightness enhancement.
ACDC dataset: The ACDC dataset comprises 100 patients focusing on the segmentation of the right ventricle cavity, left ventricle myocardium, and left ventricle cavity. The segmentation labels for each case include left ventricle (LV), right ventricle (RV), and myocardium (MYO). We use 70 cases (1930 axial slices) for training: 10 for validation and 20 for testing.
BUSI dataset: The BUSI (Breast Ultrasound Image) dataset, categorized into normal, benign, and malignant classes, serves as a comprehensive resource for both classification and segmentation tasks in the analysis of breast ultrasound images. For training purposes, only the normal and malignant categories from the BUSI dataset are selected.
3.2. Experiment Settings
Synapse dataset and BraTS2021 Dataset. The pre-training model is initialized on the ImageNet dataset with an initial learning rate of 0.05. A clustering learning rate strategy is employed. The maximum training epoch is set to 150, batch size is 32, and the SGD optimizer with a momentum of 0.9 and weight decay of 1 × is used for RL-Unet. Many different data enhancement methods are used to expand the data diversity, including rotating the image 30 degrees, 45 degrees, 90 degrees, 120 degrees, and 150 degrees; moving the image (15,15), (−15, −15), (−15, 15), and (15, −15); and using enhancement factors 0.9 and 1.3 to enhance image tone.
ACDC dataset and BUSI dataset. We train each model for a maximum of 150 epochs, with a batch size of 32. For RL-Unet, we set the input resolution to 224 × 224. For data augmentation, we use random flipping and rotation. The combined weighted Cross-entropy (0.3) and DICE (0.7) loss functions are optimized.
3.3. Experiment Results
The parameters of DSC and HD obtained by the segmentation of eight abdominal organs by RL-Unet in the Synapse dataset are compared with the classical segmentation networks ViT, U-Net, Swin-Unet, Attention Unet (AttU-Net), GCASCADE, nnFormer, MISSFormer, and Recurrent Residual Convolutional Neural Network based on U-Net (R2U-Net). The experimental results are shown in
Table 1.
Table 1.
Segmentation results of different methods on the SYNAPSE dataset.
Table 1.
Segmentation results of different methods on the SYNAPSE dataset.
Methods | Aorta | Gallbladder | Kidney (L) | Kidney (R) | Liver | Pancreas | Spleen | Stomach | DSC | HD |
---|
ViT [18] | 44.38 | 39.59 | 67.46 | 62.94 | 89.21 | 43.14 | 75.45 | 69.87 | 61.50 | 39.61 |
U-Net [7] | 89.18 | 65.42 | 76.86 | 70.64 | 93.35 | 55.37 | 89.80 | 76.01 | 77.03 | 39.70 |
Swin-Unet [20] | 86.34 | 63.45 | 81.13 | 75.69 | 93.63 | 56.83 | 87.94 | 73.22 | 77.28 | 26.93 |
AttU-Net [10] | 88.59 | 64.42 | 81.73 | 76.77 | 93.99 | 63.68 | 89.56 | 71.18 | 78.74 | 29.54 |
R2U-Net [42] | 88.26 | 68.97 | 76.94 | 71.36 | 91.86 | 57.36 | 87.36 | 74.88 | 77.12 | 32.12 |
GCASCADE [28] | 81.55 | 79.75 | 83.06 | 64.46 | 62.89 | 64.43 | 82.87 | 89.39 | 76.05 | 12.34 |
nnFormer [29] | 92.04 | 70.17 | 86.57 | 86.25 | 96.84 | 83.35 | 90.51 | 86.83 | 86.57 | 10.63 |
MISSFormer [30] | 86.99 | 68.65 | 85.21 | 82.00 | 94.41 | 65.67 | 91.92 | 80.81 | 81.96 | 18.20 |
RL-Unet (ours) | 86.95 | 67.48 | 82.68 | 79.47 | 93.84 | 55.97 | 90.21 | 77.41 | 80.13 | 22.07 |
Compared with basic ViT and U-Net, DSC and HD parameters of RL-Unet have been greatly improved, which demonstrates the efficiency of the combination of CNN and Transformer. Secondly, compared with the classical medical segmentation algorithm U-Net, the parameters of DSC and HD of RL-Unet increased by 3.1% and 17.63 mm, respectively, indicating the feasibility of the segmentation effect of RL-Unet and its sensitivity to small organs. Compared with Swin-Unet, AttU-Net, and R2U-Net, the parameters of DSC and HD of RL-Unet are still greatly improved, which indicates that RL-Unet pays attention to local information and extracts and integrates multi-scale information in context, effectively improving the segmentation accuracy. Although our proposed algorithm exhibits a slightly lower DSC compared to the latest MISSFormer and nnFormer algorithms, we are unable to precisely reproduce the results of the original authors due to machine limitations and variations in data processing methods. To validate the effectiveness of our proposed algorithm, we conducted experiments on the BraTS2021, ACDC, and BUSI datasets.
As shown in
Figure 6, we can see that the boundary of RL-Unet segmentation is clearer and smoother than that of other algorithms. It can be seen from the second line that several algorithms are missing the segmentation of pancreas, and the segmentation boundary of stomach in this dimension is poor, which overlaps with the prediction of pancreas. It can be seen from the three lines that several algorithms have a good segmentation effect and high segmentation accuracy for large targets such as the liver and stomach. Other algorithms resulted missing segmentation and over-segmentation on the pancreas. However, our proposed algorithm RL-Unet has a significantly better segmentation effect on the pancreas and small target organs than other algorithms, basically completely segmenting the pancreas and stomach. The fourth line reveals that the Attention Unet and Swin-Unet algorithms have a poor segmentation effect on the stomach, with a missing segmentation phenomenon, while RL-Unet and Attention Unet have a better segmentation effect on the stomach.
Despite the RL-Unet architecture’s capability to fuse multi-level semantic features, thereby preserving crucial edge information, its performance in pancreas segmentation remains suboptimal on certain images. This is primarily due to the small proportion of pancreas representation in the dataset, which leads to interference from surrounding organs and tissues during feature extraction. Consequently, a considerable amount of edge information is lost, compromising the model’s overall segmentation accuracy for the pancreas.
Figure 7a describes the average DSC scores obtained by different models with different training iterations on the Synapse dataset. As the number of training iterations increases, the average DSC score of our model is higher than that of other models. The DSC index, which calculates the similarity between the two samples, shows that our training results and labels are more similar, verifying the validity of our proposed model.
Figure 7b depicts the average HD resulting from different iterations of training using different models on the Synapse dataset. As can be seen from Figure
7b, the HD distance of our proposed model is superior to other models, significantly smaller than other models, and higher than the Att-Unet model only at 20 K, which may cause anomalies caused by isolated points, demonstrating that the result of dataset segmentation is good and close to the marked result.
As can be seen from
Table 2, the RL-Unet algorithm proposed in this paper can still achieve a good segmentation effect on the BraTS2021 dataset, with DSC indexes reaching more than 80%, achieving similar or even surpassing results with U-Net, Attention Unet, Swin-Unet, and other classical medical image segmentation algorithms. It shows that the algorithm has good generalization ability and robustness.
To verify the generalization of the model, we performed another set of experiments on MRI images of the ACDC dataset.
Table 3 shows the results of average DSC comparison between our proposed method and previous advanced methods. It can be clearly seen that compared with previous models such as AttnUnet, SwinUnet, TransUnet, and MT-Unet, the proposed method has a higher average DSC score; compared with PVTCASCADE model, the average DSC score is slightly lower, but it is still quite capable.
As shown in
Table 4, RL-Unet outperforms most SORT architectures on the BUSI dataset, achieving a higher DSC score. Notably, the performance of RL-Unet is slightly below that of U-Net3+. We attribute U-Net3+’s superior performance to its full-scale skip connections, which are particularly effective for accurately segmenting complex structures in medical images. To address the limitations of full-scale skip connections, we have introduced JCA and JSA into our architecture, enhancing its performance to a comparable level.
As illustrated in
Figure 8, the detailed annotations and consistent segmentation shapes in the ACDC dataset enable most architectures, including RL-Unet, to achieve high segmentation accuracy and visually appealing results. In certain cases, U-net exhibits partial segmentation omissions. As observed in the third line, the segmentation results produced by RL-Unet and Attention Unet are more complete and closely aligned with the Ground Truth. In contrast, the segmentation contours of Swin-Unet and U-Net show partial omissions. In the classification of benign cases, most architectures achieve satisfactory segmentation, with smooth transitions along the segmentation boundaries. However, in the classification of malignant cases, most architectures are only able to delineate the approximate contours and often lack the ability to capture detailed features accurately. RL-Unet more effectively restores details, mitigating issues of over-segmentation and under-segmentation.
3.4. Ablation Experiment
The four parts of the proposed model are validated by ablation experiments. (1) The MLP in the Transformer module is replaced with Res-MLP and a local inductive bias module is introduced. (2) A new feature aggregation module DUSM is proposed. (3) A new attention module is introduced. (4) A new loss function is designed.
First, replace the MLP with Res-MLP in the Transformer module and add the local inductive bias module. The baseline model is Swin-Unet. As can be seen from
Table 5, the DSC index increased by 1.08%. Experiments show that the Res-MLP module and local inductive bias module make the semantic feature propagation of Transformer more comprehensive and increase the encoder’s understanding of global context information, thus improving the performance of the whole structure. Next, a new feature aggregation module was designed at the decoding end, and the index of DSC improved by 1.31%. Meanwhile, based on previous work on linear attention mechanisms, a new attention mechanism was introduced to enhance the channel and spatial relationship between semantic features. Experiments show that our proposed attention mechanism is effective. Finally, a novel loss function was proposed. When the loss function is applied to our proposed method, compared with the Swin-Unet and other algorithms, as shown in
Figure 7c, it can be seen that the loss obtained by our proposed method is smoother and has stronger convergence, which verifies the effectiveness of our proposed algorithm. According to multiple experiments, the DSC score increased by 2.85% and the HD index also significantly improved, which proved the effectiveness of our proposed loss function.
3.5. Hyperparametric Learning Rate
The experiment shows that with the change in learning rate, the experimental results will also change. In order to obtain the best learning rate, the parameter was selected as 0.01, 0.05, and 0.1 for experimental comparison, as shown in
Table 6. Under other conditions being the same, it is found that the best segmentation results can be obtained when
is 0.05. Consequently,
= 0.05 was used as the starting point for every experiment in this research.