MS-UNet: Multi-Scale Nested UNet for Medical Image Segmentation with Few Training Data Based on an ELoss and Adaptive Denoising Method

Chen, Haoyuan; Han, Yufei; Yao, Linwei; Wu, Xin; Li, Kuan; Yin, Jianping

doi:10.3390/math12192996

Open AccessArticle

MS-UNet: Multi-Scale Nested UNet for Medical Image Segmentation with Few Training Data Based on an ELoss and Adaptive Denoising Method^†

by

Haoyuan Chen

,

Yufei Han

,

Linwei Yao

,

Xin Wu

,

Kuan Li

^*

and

Jianping Yin

School of Computer Science and Technology, Dongguan University of Technology, Dongguan 523808, China

^*

Author to whom correspondence should be addressed.

^†

This article is a revised and expanded version of a paper entitled “MS-UNet: Swin Transformer U-Net with Multi-scale Nested Decoder for Medical Image Segmentation with Small Training Data”, which was presented at the 6th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2023, Xiamen, China, 13–15 October 2023.

Mathematics 2024, 12(19), 2996; https://doi.org/10.3390/math12192996

Submission received: 13 August 2024 / Revised: 23 September 2024 / Accepted: 24 September 2024 / Published: 26 September 2024

(This article belongs to the Special Issue Machine-Learning-Based Process and Analysis of Medical Images)

Download

Browse Figures

Versions Notes

Abstract

:

Traditional U-shape segmentation models can achieve excellent performance with an elegant structure. However, the single-layer decoder structure of U-Net or SwinUnet is too “thin” to exploit enough information, resulting in large semantic differences between the encoder and decoder parts. Things get worse in the field of medical image processing, where annotated data are more difficult to obtain than other tasks. Based on this observation, we propose a U-like model named MS-UNet with a plug-and-play adaptive denoising module and ELoss for the medical image segmentation task in this study. Instead of the single-layer U-Net decoder structure used in Swin-UNet and TransUNet, we specifically designed a multi-scale nested decoder based on the Swin Transformer for U-Net. The proposed multi-scale nested decoder structure allows for the feature mapping between the decoder and encoder to be semantically closer, thus enabling the network to learn more detailed features. In addition, ELoss could improve the attention of the model to the segmentation edges, and the plug-and-play adaptive denoising module could prevent the model from learning the wrong features without losing detailed information. The experimental results show that MS-UNet could effectively improve network performance with more efficient feature learning capability and exhibit more advanced performance, especially in the extreme case with a small amount of training data. Furthermore, the proposed ELoss and denoising module not only significantly enhance the segmentation performance of MS-UNet but can also be applied individually to other models.

Keywords:

medical image segmentation; Swin Transformer; multi-scale nested decoder; ELoss; plug-and-play adaptive denoising module

MSC:

68T07

1. Introduction

Medical image segmentation plays a crucial role in computer-aided diagnosis and intelligent medicine [1]. CNN-based networks, such as U-Net [2], and their variants have been dominant in this field, showcasing notable performance improvements [3,4,5]. The symmetric encoder-decoder structure of U-Net, coupled with skip connections, allows for the extraction of rich image information while maintaining a relatively lightweight design.

However, CNN-based networks face limitations. CNNs have an inductive bias, limiting receptive fields to fixed-size windows and hindering long-range pixel dependencies. Transformers, powered by multi-head self-attention [6], establish global connections between sequence tokens [7]. Combining transformers with CNNs [8] localizes them for pixel-level tasks. In order to further improve performance on segmentation tasks for CNN-based models, the proposal of DLA [9] and UNet++ [3] have also successfully demonstrated that deep layer aggregation could effectively improve the recognition and segmentation performance with sufficient datasets. For transformer-based models, from TransUnet [10] to Swin-UNet [11], researchers focus on combining CNNs and transformers, with the overall structures still following the traditional U-Net structure. However, the single-layer structure of U-Net is too simply designed (connecting only through skip connection) to accommodate the complexity of the transformer structure, resulting in large semantic differences between the encoder and decoder parts. Things get worse if the number of training sets of data is not sufficiently large, which is common in medical image processing tasks where annotated data are more difficult to obtain than other tasks.

Thus, we propose MS-UNet, a simple but more effective 2D medical image segmentation architecture based on Swin Transformer. MS-UNet replaces the single-layer decoder structure of Swin-UNet with a hierarchical multi-scale decoder structure inspired by [3,12,13]. The multi-scale nested decoder structure enables the proposed MS-UNet to learn the semantic information of the feature map from a more multi-dimensional perspective. In other words, MS-UNet could obtain a tighter semantic hierarchy of the feature map between the encoder and decoder parts, thus effectively improving its stability and generalization. Furthermore, these will somehow reduce the degree of requiring large-scale training data. This is important for medical image datasets, the labeled data of which are more difficult to obtain than other datasets.

Aside from the model structure, edge segmentation performance remains a common challenge in current medical image segmentation tasks [14,15,16]. In order to address this, researchers have explored edge-aware modules and corresponding loss functions [17,18,19,20,21,22,23]. However, these methods often increase model complexity and training costs. Our study introduces ELoss, a loss function tailored for medical image segmentation. ELoss boosts network sensitivity to target boundaries during training, leading to finer segmentation edges without adding computational complexity or extending training time.

While MS-UNet and ELoss improve segmentation, noise in medical image data remains a challenge [24]. Noise can obscure vital details, hindering model recognition and analysis during training [25,26,27]. We propose a plug-and-play adaptive denoising (PAD) module, which was inspired by the “pretraining then fine-tuning” paradigm [28,29] and trainable denoising techniques [22,30,31,32]. The PAD module includes a trainable denoising component to fine-tune the model after pretraining, reducing noise interference.

The PAD module prevents the network from learning incorrect features over multiple training epochs, preserving overall performance. It incorporates the denoising module to capture vital medical image features while minimizing noise impact. This module helps the model learn and retain crucial yet often overlooked image details from early training stages. PAD effectively addresses challenges related to limited medical image datasets and noise presence. By leveraging “pretraining then fine-tuning” and trainable denoising techniques, our approach enhances medical image segmentation accuracy while reducing noise interference. The experimental results in subsequent sections demonstrate our method’s effectiveness.

In this paper, our contributions are four-fold:

A semantic segmentation framework with a multi-scale nested decoder, MS-UNet, achieves a huge improvement in segmentation results than other SOTA models in the U-Net family in the extreme case where only a very small amount of training data is available;
ELoss is proposed to increase the sensitivity of the network to boundary information without increased computational complexity and extra training time. It allows the network to output smoother edges, making the predictions closer to the ground truth;
We propose a plug-and-play adaptive denoising (PAD) module, which can prevent the model from learning the wrong feature from image noise. A few fine-tuning epochs obtain huge segmentation performance enhancements in return;
We have successfully used MS-UNet with ELoss and the PAD module for CT, MRI, and X-ray medical image processing tasks and have surpassed other state-of-the-art methods.

The remainder of the paper is organized as follows: Section 2 briefly reviews the related work. In Section 3, we describe the details of our proposed MS-UNet, ELoss, and plug-and-play denoising module. Section 4 reports the results of MS-UNet on medical image segmentation tasks, as well as a series of ablation experiments. We summarise our work in Section 5.

2. Related Works

2.1. Medical Image Segmentation Models

UNet, which is based on convolutional neural networks (CNNs), has excelled in medical image segmentation, with various extensions proposed for enhancement. UNet++ [3] introduced nested and dense connections to mitigate semantic disparities within the network. Res-UNet [4] combined U-Net with ResNet and attention mechanisms, notably improving retinal vessel segmentation.

In recent years, the transformer model has gained significant traction in computer vision [33]. Vision Transformer (ViT) [33], as the pioneer pure transformer-based model, has revolutionized various computer vision tasks by processing 2D image patches with positional embeddings. Its success has inspired researchers to explore its application in diverse domains, including medical image segmentation. Chen et al. introduced the first transformer-based medical image segmentation framework [10], using the CNN feature map as input to the transformer encoder, achieving remarkable results in medical image segmentation tasks. Furthermore, Swin Transformer [8], a hierarchical vision transformer, combines the strengths of transformers and CNNs. Utilizing shifted windows not only reduces computational complexity but also achieves state-of-the-art performance in computer vision tasks In building upon this, researchers introduced a pure transformer network with a U-shaped encoder-decoder architecture for 2D medical image segmentation [11]. Additionally, the Dual Swin Transformer U-Net (DS-TransUNet) was proposed, incorporating the hierarchical Swin Transformer’s advantages into a standard U-shaped Encoder-Decoder architecture, effectively improving medical image segmentation performance [13].

We propose MS-UNet, a simple but highly effective 2D medical image segmentation multi-scale architecture based on Swin Transformer. This architecture allows MS-UNet to learn semantic information from the feature map in a more comprehensive and multi-dimensional manner, leading to improved segmentation performance.

2.2. Edge Loss

In various studies focused on improving segmentation performance, particularly in the context of medical image analysis, researchers have recognized the pivotal role of loss functions in image segmentation. The precise delineation of anatomical structure boundaries is often crucial, as it is in expert manual segmentation.

In order to address this, researchers have introduced innovative approaches. For instance, Hatamizadeh et al. incorporated a dedicated edge branch and an edge-aware loss term to consider organ boundary information [21]. Similarly, Kuang et al. improved edge information perception by adding an edge branching module with edge loss [22]. Han et al. proposed a class-aware edge loss module that reduces segmentation errors in edge regions without affecting inference speed [23]. In medical image segmentation, Gu et al. designed an edge information guiding module (EIGM) and introduced an overall-center-edge (ECE) loss function to optimize boundary details by emphasizing boundary information [14]. However, it is essential to note that while these approaches enhance edge information learning, they may introduce additional parameters and increase the model’s computational complexity, potentially impacting learning efficiency.

In this paper, we propose ELoss, which operates on the entire model, enabling the production of finer segmented edges without the need for additional edge branching modules. Furthermore, we propose an algorithm for generating edge-labeled data using the Sobel operator, which efficiently produces edge labels for datasets lacking pre-existing edge annotations.

2.3. Denoising Methods

In medical image processing, noise can severely degrade image quality and impact model predictions. Consequently, denoising is a crucial preprocessing step in this domain [27].

Researchers have explored various denoising techniques tailored to medical images. Liang et al. developed an edge enhancement module using trainable Sobel convolution, integrating extracted edge information for end-to-end image denoising in a densely connected model [30]. Luthra et al. combined the trainable Sobel-Feldman operator with a transformer, introducing Eformer, an edge-enhanced transformer-based approach for medical image denoising [31]. While these methods have shown promise, they often require substantial computational resources. Medical images contain subtle details crucial for representing medical features, making it challenging to preserve these details during denoising. Recently, the “pretraining then fine-tuning” paradigm, borrowed from large models applied in computer vision tasks [28] and natural language processing (NLP) tasks [32], has gained traction for addressing this challenge.

Motivated by this, we propose a fine-tuning trainable denoising module. This module reduces noise by introducing a trainable denoising component while keeping the model backbone frozen. By adding a small number of parameters to the trained model, our approach effectively reduces medical image noise and prevents the model from incorrectly utilizing noise as important feature information without sacrificing relevant medical features.

3. Method

Figure 1 showcases MS-UNet, which employs Swin Transformer as its backbone network—a departure from traditional U-like architectures. It comprises an encoder step, a multi-scale nested decoder step, and skip connections [34]. We enhance its segmentation capability with the plug-and-play adaptive denoising (PAD) module, consisting of a trainable Gaussian block and anisotropic diffusion block, to preprocess images and mitigate the impact of noise. Our loss function incorporates three elements: Dice similarity coefficient (DSC) loss, Hausdorff distance (HD) loss, and our ELoss. Below, we would like to delve into each component of our MS-UNet in detail.

3.1. MS-UNet

Our MS-UNet, which is a vital component, combines Swin Transformer and multi-scale architectures for medical image segmentation. It comprises an encoder step, a multi-scale nested decoder step, and skip connections.

In the encoder, we partition the medical image into patches, forming raw features by concatenating pixel values. Swin Transformer blocks and the patch merging layers transform these features, capturing multi-scale information and global context. The decoder branch also employs Swin Transformer with upsampling modules and patch-expanding layers, ensuring consistency with the encoder. The multi-scale nested decoder replaces the traditional decoder, improving feature extraction from the complex encoder output.

Multi-scale nested blocks within the decoder branch serve two vital functions. They independently upsample decoded features from each encoder, integrating information at different scales, and employ skip connections to concatenate these features with corresponding decoder input features in subsequent layers. This fusion process aligns semantic levels, facilitating information exchange and feature consistency.

Our MS-UNet with the multi-scale nested decoder structure efficiently fuses features across scales and between adjacent decoders, ensuring robust information flow. This structure enhances stability, generalization, and performance, making it suitable for medical image segmentation, even with limited training data.

3.2. Loss Functions

The loss functions we use in the training process are the cross-entropy loss function, the Dice similarity coefficient loss function, and our proposed ELoss function. The equation for this concept is

L_{t o t a l} = ω_{1} L_{C E} + ω_{2} L_{D S C} + ω_{3} L_{E d g e},

(1)

3.2.1. Cross-Entropy (CE) Loss

Cross-entropy loss is the most popular loss function for image semantic segmentation tasks, which examines each pixel individually and compares the predictions for each pixel class with the label vector. The equation for this concept is

ℓ_{n} = - log \frac{exp (x_{n}, y_{n})}{\sum_{c = 1}^{C} exp (x_{n}, c)} \cdot 1,

(2)

L_{C E} (x, y) = \sum_{n = 1}^{N} \frac{ℓ_{n}}{N},

(3)

where C is the number of classes, x is the input, y is the ground truth, and N spans the minibatch dimension.

3.2.2. Dice Similarity Coefficient (DSC) Loss

Dice similarity coefficient loss is a metric function for calculating the similarity of sets. It is usually used to calculate the similarity between two samples, and its value ranges from [0, 1]. The equation for this concept is

L_{D S C} = 1 - \frac{| X \cap Y |}{| X | + | Y |},

(4)

where X and Y are two sets. The set in the

| \cdot |

represents the cardinality of the set; that is, the number of elements in the set. ∩ is used to represent the intersection of two sets and means the elements that are common to both sets.

3.2.3. ELoss

In medical image analysis, anatomical structure boundaries are crucial for manual segmentation. In order to address the jagged edges in model outputs, we introduce ELoss, enhancing the model’s sensitivity to image boundaries and producing smoother and more accurate segmentation results. The equation for this concept is

ℓ_{e d g e} = \frac{2 \times s u m (t a r g e t * s c o r e)}{s u m (t a r g e t * t a r g e t) + s u m (s c o r e * s c o r e)},

(5)

L_{E d g e} = 1 - ℓ_{e d g e},

(6)

where the target is the tensor of the edge grand truths, the score is the tensor of results calculated from the output and the edge grand truths, and ∗ represents the point-to-point multiplication between two tensors;

s u m (\cdot)

represents the sum of the elements within the tensor. The detailed calculation process for the score is illustrated in Figure 2.

Our ELoss, unlike other models, is applied directly to the final model output rather than to a specific edge branch prediction module. This unique approach enhances the model’s sensitivity to edge information without introducing additional computational complexity or training time. Importantly, this method could be easily applied to any other model, boosting its ability to learn from edge information.

3.3. Edge Label Generation Method

In medical image segmentation, acquiring labeled data, especially for specific parts such as edges, is challenging and resource-intensive. In order to overcome this hurdle, we propose an efficient edge label generation method using the Sobel operator. We start with an image tensor extracted from the ground truth dataset and assign a value of 1 to labeled elements and 0 to others based on the Sobel operator’s characteristics. The tensor then undergoes Sobel convolution to calculate the edge output. We consider elements greater than

0.1

as edge labels, creating a preliminary edge label tensor. Finally, we combine this with the original image tensor using logical conjunction to obtain the final edge label data.

This method allows us to efficiently and accurately generate edge labels, significantly increasing the diversity of labeled data, a crucial advantage in medical image segmentation tasks with limited annotated data.

3.4. Plug-and-Play Adaptive Denoising Module

In light of recent advancements in the “pretraining then fine-tuning” paradigm in the computer vision community [28], we propose a plug-and-play adaptive denoising module. This approach ensures that models learn a sufficient number of features before mitigating the impact of noise, as depicted in Figure 3. By incorporating this module into the overall framework, we aim to strike a balance between noise reduction and feature preservation, ultimately improving the performance of medical image-denoising algorithms.

We first trained MS-UNet using original images to ensure that the model could learn the detailed features of images. After that, we introduce the PAD module and fine-tune all the parameters of the model. In the PAD module, we first perform the trainable Gaussian convolution and Anisotropic diffusion convolution on the original input to obtain two denoising images, respectively. In anisotropic diffusion convolution, we use an eight-way diffusion operation with 30 rounds. The four-way diffusion equation for this concept is

\begin{matrix} I_{t + 1} = I_{t} + λ ( & c N_{x, y} \nabla_{N} (I_{t}) + c S_{x, y} \nabla_{S} (I_{t}) \\ + c E_{x, y} \nabla_{E} (I_{t}) + c W_{x, y} \nabla_{W} (I_{t})), \end{matrix}

(7)

where I represents the image, and t represents the diffusion round.

\nabla_{N}

,

\nabla_{S}

,

\nabla_{E}

, and

\nabla_{W}

represent the bias in four directions for the pixel

(x, y)

.

\begin{matrix} \nabla_{N} (I_{x, y}) & = (I_{x, y - 1} - I_{x, y}), \\ \nabla_{S} (I_{x, y}) & = (I_{x, y + 1} - I_{x, y}), \\ \nabla_{E} (I_{x, y}) & = (I_{x - 1, y} - I_{x, y}), \\ \nabla_{W} (I_{x, y}) & = (I_{x + 1, y} - I_{x, y}), \end{matrix}

(8)

c N

,

c S

,

c E

, and

c W

represent the thermal conductivity in four directions.

\begin{matrix} c N_{x, y} & = exp (- ∥ ω_{N} \nabla_{N} (I) ∥^{2} / k^{2}), \\ c S_{x, y} & = exp (- ∥ ω_{S} \nabla_{S} (I) ∥^{2} / k^{2}), \\ c E_{x, y} & = exp (- ∥ ω_{E} \nabla_{E} (I) ∥^{2} / k^{2}), \\ c W_{x, y} & = exp (- ∥ ω_{W} \nabla_{W} (I) ∥^{2} / k^{2}), \end{matrix}

(9)

where

ω_{N}

,

ω_{S}

,

ω_{E}

, and

ω_{W}

represent trainable weight parameters in four directions.

Subsequently, the two noise-reduced images obtained by the above operations are concatenated with the original image and then processed by a

1 \times 1

convolutional layer with a stride of 1. The final batch normalization, ReLU activation, and Sigmoid function are performed to produce a noise-reduced weight map. With the module, we could reduce the impact of image noise on performance by heating a few parameters during fine-tuning training.

4. Experiments

In this section, we will describe the experimental setup, as well as the datasets we have used. Our proposed MS-UNet is compared with several benchmarks on three public medical image segmentation datasets, including CT, MRI, and X-ray datasets. Finally, we have conducted an extensive ablation study on MS-UNet.

4.1. Datasets

4.1.1. Synapse

The synapse multi-organ segmentation dataset [35] includes 30 abdominal CT scans and 3779 axial contrast-enhanced abdominal clinical CT images. Following the literature [10,35], we randomly split the dataset into 18 training sets and 12 testing sets and evaluated our method on eight abdominal organs (aorta, gallbladder, spleen, left kidney, right kidney, liver, pancreas, spleen, and stomach) using the average dice similarity coefficient (DSC) and the average Hausdorff distance (HD) as evaluation metrics.

4.1.2. ACDC

The automated cardiac diagnosis challenge dataset [36] is an open-competition cardiac MRI dataset containing left ventricular (LV), right ventricular (RV), and myocardial (MYO) labels. The dataset contains 100 samples, which we divided into 70 training samples, 10 validation samples, and 20 test samples for consistency with [37].

4.1.3. JSRT

The Japan Society of Radiological Technology dataset [38] provides 247 X-ray images that are

1024 \times 1024

in size, containing lung, heart, and clavicle labels. In this study, we scaled the images to

224 \times 244

by using interpolation operations and only experimented with lung segmentation. The JSRT dataset is randomly divided into three independent sets: 172 for the training data, 25 for the validation data, and 50 for the test data.

4.2. Implementation Details

All experiments were conducted on an RTX3090 GPU (Nvidia, Santa Clara, CA, USA). The Swin Transformer backbone in MS-UNet was pretrained on ImageNet [39]. Input images were resized to

224 \times 224

, with a patch size of

4 \times 4

. During training, the batch size was set to 24 for all datasets, and we used the SGD optimizer with a momentum of 0.9, a decay of

1 \times 10^{4}

, and a learning rate of 0.05 for Synapse and 0.0001 for ACDC and JSRT. We trained Synapse and ACDC for 250 epochs and JSRT for 500 epochs. For the weights in Equation (1), we set

ω_{1}

and

ω_{2}

to 0.5 and

ω_{3}

to 0.1. For the secondary training of the PAD module, we froze the parameters of the primary model and trained only the PAD module. In this process, we secondary trained Synapse and ACDC for 100 epochs and JSRT for 200 epochs.

For evaluation, we employed two metrics: the Dice similarity coefficient and the Hausdorff distance. The Dice similarity coefficient measures segmentation similarity, as detailed in Section 3.2.3. Hausdorff distance is used to quantify the similarity between two sets of points by calculating their distance. Suppose there are two sets of sets:

A = {a^{1}, a^{2}, \dots, a^{n}}

,

B = {b^{1}, b^{2}, \dots, b^{n}}

, the Hausdorff distance between these two sets of points are defined as

H (A, B) = m a x (h (A, B), h (B, A)),

(10)

h (A, B) = \underset{a \in A}{m a x} \underset{b \in B}{m i n} ‖ a - b ‖,

(11)

h (B, A) = \underset{b \in B}{m a x} \underset{a \in A}{m i n} ‖ b - a ‖,

(12)

where

‖ \cdot ‖

represents the Euclidean distance between elements.

4.3. Experimental Results

Table 1, Table 2 and Table 3 show the segmentation results of MS-UNet and other up-to-date methods on the Synapse multi-organ CT dataset, ACDC dataset, and JSRT dataset. From the results, it can be seen that the proposed MS-UNet exhibits overall better performance than the other methods. In particular, we conducted an experiment to compare the performances of different methods when only using part of the training data.

4.3.1. Performance on Synapse

In the Synapse dataset, we used three open-source U-like family models and four results from [10] as baselines for a quantitative comparison with our MS-UNet model. Table 1 presents the quantitative comparison results, consistently showing MS-UNet outperforming the baselines. Notably, when compared to SwinUnet [11], MS-UNet achieves a significant improvement of

1.52 %

in DSC and

3.74

in HD.

Furthermore, when comparing MS-UNet to other baseline models in Figure 4, it exhibits quantitative improvements in most organ segmentation metrics. By incorporating our proposed ELoss into MS-UNet, we achieve additional improvements of

0.44 %

in DSC and

0.91

in HD. Introducing the PAD module in the trained MS-UNet model further enhances results, with a DSC improvement of

0.94 %

and an HD improvement of

2.82

. The visualized outcomes can be seen in Figure 5. Significantly, our method demonstrates substantial advancements in kidney organ segmentation. Specifically, Kidney(L) segmentation improves by

4.37 %

compared to UNet [2], and Kidney(R) segmentation achieves a notable enhancement of

5.07 %

compared to SwinUnet [11].

These quantitative results affirm the effectiveness of our MS-UNet model in accurately segmenting organs in the Synapse dataset. Integrating ELoss and the plug-and-play adaptive denoising module further enhances performance, especially in kidney organ segmentation. These advancements highlight the potential of our method to improve medical image segmentation tasks and contribute to enhanced clinical decision-making processes.

4.3.2. Performance on ACDC

The quantitative analysis in Table 2 demonstrates the superiority of our approach over baseline methods across all evaluation metrics for the ACDC dataset. Our proposed method achieves a mean DSC of

91.01 \pm 0.09 %

and a mean HD of

1.21 \pm 0.05

. This includes specific DSC scores of

89.11 %

for left ventricular segmentation,

88.50 %

for right ventricular segmentation, and

95.41 %

for myocardial segmentation. It comprehensively exceeds the SwinUnet scores on the dataset. Incorporating ELoss into the MS-UNet leads to further improvements, with the DSC score increasing to

91.08 \pm 0.04 %

, the HD value becoming

1.24 \pm 0.04

, and individual segmentation categories showing DSC scores of

89.10 %

for left ventricular,

88.57 %

for right ventricular, and

95.58 %

for myocardial segmentation. Similar to its performance on the Synapse dataset, the integration of our PAD module enhances segmentation performance on the ACDC dataset. With this module, our model achieves a DSC score of

91.26 \pm 0.04 %

and an HD value of

1.15 \pm 0.01

.

These results underscore the effectiveness of our proposed method in achieving high-quality segmentation results on the ACDC dataset. The integration of ELoss and the PAD module further improves segmentation performance, highlighting its potential for accurate and reliable medical image segmentation in MRI data.

4.3.3. Performance on JSRT

In Table 3, we present the segmentation results obtained using our proposed method alongside other baseline methods. Our MS-UNet achieves impressive segmentation performance, with a DSC of

98.0052 \pm 0.0108 %

and an HD of

2.3756 \pm 0.0357

. These results highlight the superior performance of our method compared to other transformer-based methods, showcasing the efficacy of the enhanced network structure in MS-UNet on the JSRT dataset. Moreover, when incorporating our proposed ELoss, the segmentation performance of MS-UNet becomes comparable to that of UNet, leveraging the advantages of convolutional neural networks for binary segmentation tasks with limited data. When using ELoss, MS-UNet achieves a DSC of

98.0219 \pm 0.0074 %

and an HD of

2.3586 \pm 0.0282

. Lastly, the introduction of the denoising module to the MS-UNet elevates our approach beyond UNet, leading to further improvements in segmentation performance. We report a DSC of

98.0543 \pm 0.0113 %

and an HD of

2.2698 \pm 0.0295

.

These results underscore the effectiveness of our proposed method in achieving accurate and reliable segmentation results on the JSRT dataset. The integration of the ELoss and the denoising module enhances segmentation performance, positioning our approach as a competitive method for medical image segmentation on X-ray images.

4.4. Ablation Study on Training Data Volume

In order to illustrate the learning efficiency of MS-UNet, ELoss, and the PAD module on few-shot datasets, we conducted an ablation study comparing the segmentation performance of MS-UNet with other baseline models under varying training data volumes. The results of this study are presented in Figure 6. We randomly selected training data at predefined ratios and trained the models accordingly. Subsequently, we evaluated the segmentation performance of each trained model using the same test data.

4.4.1. Performance of MS-UNet

On the Synapse dataset, MS-UNet achieved an impressive DSC of

78.41 %

when using only half the training data, showcasing its efficient semantic information extraction and surpassing UNet and TransUNet, which scored

78.23 %

and

77.93 %

, respectively, when using the full training data. SwinUnet achieved a DSC of

77.05 %

when using

75 %

of the data.

Similar trends were observed on the ACDC dataset, where MS-UNet consistently outperformed the baseline models, even with fewer training data. On the JSRT dataset, UNet performed best when using the complete training data, achieving a DSC of

98.03 %

, slightly above MS-UNet but superior to other transformer-based models. As training data decreased, the advantage of pretrained transformer-based models became evident. UNet’s performance degraded significantly when using limited data, while MS-UNet remained stable, even when using minimal training data. In an extreme scenario with only one-sixteenth of the data, MS-UNet outperformed other models with a remarkable DSC of

96.40 %

. In comparison, TransUNet scored

94.79 %

DSC, SwinUnet scored a

93.98 %

DSC, and UNet scored an

89.96 %

DSC.

These results highlight the effectiveness of MS-UNet to capture semantic information through its multi-scale nested decoder, making it particularly valuable in medical image processing, where labeled data are often scarce. By efficiently utilizing available data and maintaining segmentation performance stability, MS-UNet addresses challenges related to limited labeled data in medical imaging.

4.4.2. Performance of ELoss and the PAD Module

We assessed the impact of the Eloss and PAD modules on the performance of MS-UNet when using limited training data using three datasets (Table 4, Table 5 and Table 6).

In the Synapse dataset, the addition of the ELoss and PAD modules significantly improved the segmentation performance of MS-UNet, even with fewer training data. With just one-fourth of the training data, MS-UNet with the PAD module achieved remarkable results, yielding a DSC of

78.17 \pm 0.09 %

and an HD of

24.56 \pm 1.06

. These results outperformed UNet and TransUNet trained on the entire dataset and approached the performance of SwinUnet. Similar improvements were observed in both the ACDC and JSRT datasets when ELoss and the PAD module were employed. In the most extreme case, with only one-sixteenth of the training data, the PAD module reduced the HD from

2.26 \pm 0.63

to

1.45 \pm 0.03

in the ACDC dataset and from

6.28 \pm 0.21

to

3.81 \pm 0.14

in the JSRT dataset.

These results validate the effectiveness of our proposed ELoss and PAD modules for medical image segmentation tasks. The PAD module’s ability to mitigate the impact of image noise on segmentation performance, even with limited training data, highlights its practical utility in challenging scenarios.

4.5. Ablation Study on ELoss and PAD Module in Other Models

We assessed the versatility and effectiveness of our ELoss and PAD modules through ablation experiments on UNet, TransUNet, and SwinUnet using the Synapse dataset. The results in Table 7 demonstrate their effectiveness in improving segmentation performance.

UNet improved from

77.22 \pm 0.41 %

to

78.42 \pm 0.09 %

in DSC and reduced the HD from

29.31 \pm 0.85

to

29.25 \pm 1.78

when using the ELoss and PAD modules. TransUNet also benefited, with DSC increasing from

77.07 \pm 0.34 %

to

78.56 \pm 0.44 %

and HD decreasing from

31.41 \pm 1.27

to

31.27 \pm 1.42

. SwinUnet showed a significant HD improvement, from

25.69 \pm 1.28

to

21.41 \pm 1.80

, and a DSC increase from

78.45 \pm 0.49 %

to

79.22 \pm 0.15 %

.

These modules proved versatile across various models. Even with limited data, a second experiment with SwinUnet on the Synapse dataset (Table 4) showed improvements. With only one-eighth of the training data, SwinUnet with these modules achieved a DSC of

74.10 \pm 0.15 %

and an HD of

27.17 \pm 1.11

, approaching the performance of SwinUnet with one-quarter of the training data (

74.40 \pm 0.61 %

DSC and

28.01 \pm 2.31

HD).

In summary, our ELoss and PAD modules are versatile and effective across different models, making them valuable tools for accurate medical image segmentation tasks in both regular and low-data scenarios.

4.6. Ablation Study on MS-UNet Architectures

In this part, we explore the impact of the number and location of multi-scale nested decoders on the segmentation performance of MS-UNet by changing them gradually. As shown in Figure 7. From the experimental results in Table 8, even when adding only one multi-scale nested block in a shallow layer of the network, the DSC of the model improved from

78.49 \pm 0.49 %

to

79.57 \pm 0.28 %

, and the number of parameters increased by only

1 %

. This suggests that the introduction of multi-scale nested decoders enhances the model’s ability to capture and utilize semantic information, leading to improved segmentation results. Moreover, as the number of multi-scale nested blocks increases, the segmentation performance of the model continues to improve without overfitting. This observation is particularly important because it demonstrates that the benefits of the multi-scale nested decoder structure are not limited to a specific number or the location of the blocks. The model could effectively leverage multiple nested decoders to bring the feature maps between the encoder and decoder closer in a semantic sense, resulting in better segmentation performance.

These findings further support the notion that the structure of MS-UNet with multi-scale nested decoders enables the network to capture and utilize semantic information more effectively. By semantically connecting the encoder and decoder components, the model could better exploit the hierarchical relationships within the feature maps, leading to improved segmentation performance.

4.7. Ablation Study on ELoss Setting

In Equation (1), both

ω_{1}

and

ω_{2}

were fixed at 0.5, emphasizing the pivotal roles of the cross-entropy loss and the Dice similarity coefficient loss for precise semantic segmentation. In order to determine the optimal weight

ω_{3}

for ELoss, we conducted ablation experiments, and the quantitative results are shown in Table 9.

Integrating ELoss with a weight (

ω_{3}

) of 0.1 into the MS-UNet notably enhanced segmentation performance. The DSC improved from an initial

79.97 \pm 0.20 %

to

80.41 \pm 0.27 %

, and the HD decreased from

21.95 \pm 2.11

to

21.04 \pm 0.72

, highlighting the contribution of ELoss to improved accuracy. However, increasing

ω_{3}

beyond a threshold led to a decline in segmentation performance, even falling below the performance of MS-UNet without ELoss. This underscores the continued dominance of the cross-entropy loss and DSC loss in overall model training. Finding the optimal weight for ELoss is crucial for maintaining and enhancing segmentation performance.

The timing of introducing ELoss during training also influenced segmentation performance. A series of ablation experiments (for the parameters, see Section 4.2) introduced ELoss at different epochs (0th, 50th, and 100th). The results in Table 10 show that introducing ELoss from the start led to a slight improvement, with DSC increasing from

79.97 \pm 0.20 %

to

80.07 \pm 0.38 %

. Optimal performance occurred when introduced at the 50th epoch, with a DSC of

80.41 \pm 0.27 %

and an HD of

21.04 \pm 0.72

. Delaying ELoss introduction led to a decline in performance, emphasizing its importance at the right training stage.

In summary, these experiments highlight the critical roles of cross-entropy loss and DSC Loss for accurate segmentation, with ELoss as a supplementary signal. Finding the right ELoss weight and introduction timing is vital for maximizing the segmentation performance of MS-UNet.

4.8. Discussion and Future Work

In our study, we applied the MS-UNet, ELoss, and PAD modules to medical image segmentation tasks, consistently outperforming other models, even in situations with limited training data. This addresses a common challenge in medical image analysis, where labeled data are scarce and costly to acquire. Our MS-UNet offers the advantage of being adaptable to various medical image processing tasks, requiring smaller data scales. Furthermore, our proposed ELoss and PAD module, when integrated into both training and post-processing stages, significantly improved segmentation performance. However, in this paper, only three types of image segmentation tasks, such as CT, MRI and X-ray, have been studied, and we could not guarantee that our proposed MS-UNet could achieve excellent results in other types of medical image segmentation tasks.

In future research, we plan to further optimize these modules to enhance their effectiveness in improving segmentation performance. Particularly, in the multi-scale nested decoder, we only performed the stacking of multiple Swin Transformer blocks to combine the features of different dimensions, which increases the number of parameters of the model at the same time. However, the structure of the model should be more simplified in this process to reduce the parameters of the model and thus improve the efficiency of feature fusion. We aim to explore their applicability to other models, extending their benefits to a wider range of segmentation tasks. While we currently focus on 2D segmentation tasks with MS-UNet, we intend to improve it for 3D segmentation. We also plan to investigate pretraining transformer-based models on unlabeled medical images using self-supervised learning, potentially combining them with our proposed methods. Furthermore, we are exploring interpretable deep learning techniques inspired by causal inference and white-box transformers for medical image processing tasks [41,42].

5. Conclusions

In this work, a novel transformer-based U-shaped medical image segmentation network called MS-UNet has been proposed. MS-UNet addresses the challenge of capturing the semantic information between the encoder and decoder components by introducing a multi-scale nested decoder structure. This design enables a tighter semantic hierarchy and improves the stability and generalization of the model. The experimental results have demonstrated the excellent performance and efficient feature learning ability of MS-UNet. Notably, MS-UNet outperforms other U-Net family models even in scenarios with a very limited amount of training data. This is a significant contribution, considering the scarcity and costliness of labeled medical images. Additionally, this work introduces a novel ELoss and a companion edge-label-generation method, which collectively enhances the segmentation performance of MS-UNet. ELoss emphasizes the importance of edge information, further improving the accuracy of the segmentation results. Furthermore, a plug-and-play adaptive denoising (PAD) module is proposed, which could be utilized not only with MS-UNet but also with other segmentation models. This module effectively reduces the noise in the completed phase of the model, leading to improved segmentation performance.

Author Contributions

Conceptualization, H.C. and K.L.; Methodology, H.C., Y.H., X.W. and K.L.; Software, H.C.; Validation, H.C.; Formal analysis, H.C. and Y.H.; Investigation, H.C. and L.Y.; Resources, K.L. and J.Y.; Data curation, H.C.; Writing—original draft, H.C.; Writing—review & editing, X.W. and K.L.; Visualization, H.C.; Supervision, K.L. and J.Y.; Project administration, K.L. and J.Y.; Funding acquisition, K.L. and J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key Research and Development Program of China under Grant 2022YFF0606303, the National Natural Science Foundation of China under Grant 62206054, Research Capacity Enhancement Project of Key Construction Discipline in Guangdong Province under Grant 2022ZDJS028.

Data Availability Statement

The code is publicly available at: https://github.com/HH446/MS-UNet (accessed on 1 September 2024).

Conflicts of Interest

The authors declare no conflict of interest.

References

Shamshad, F.; Khan, S.; Zamir, S.W.; Khan, M.H.; Hayat, M.; Khan, F.S.; Fu, H. Transformers in medical imaging: A survey. Med. Image Anal. 2023, 88, 102802. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015, Proceedings, Part III; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018, Proceedings; Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Xiao, X.; Lian, S.; Luo, Z.; Li, S. Weighted res-unet for high-quality retina vessel segmentation. In Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education (ITME), Hangzhou, China, 19–21 October 2018; pp. 327–331. [Google Scholar]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.W.; Wu, J. Unet 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Yu, F.; Wang, D.; Shelhamer, E.; Darrell, T. Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2403–2412. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Computer Vision—ECCV 2022 Workshops; Springer: Cham, Switzerland, 2022; pp. 205–218. [Google Scholar]
Jha, D.; Riegler, M.A.; Johansen, D.; Halvorsen, P.; Johansen, H.D. Doubleu-net: A deep convolutional neural network for medical image segmentation. In Proceedings of the 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), Rochester, MN, USA, 28–30 July 2020; pp. 558–564. [Google Scholar]
Lin, A.; Chen, B.; Xu, J.; Zhang, Z.; Lu, G.; Zhang, D. Ds-transunet: Dual swin transformer u-net for medical image segmentation. IEEE Trans. Instrum. Meas. 2022, 71, 4005615. [Google Scholar] [CrossRef]
Gu, R.; Wang, L.; Zhang, L. DE-Net: A deep edge network with boundary information for automatic skin lesion segmentation. Neurocomputing 2022, 468, 71–84. [Google Scholar] [CrossRef]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Lv, J.; Hu, Y.; Fu, Q.; Zhang, Z.; Hu, Y.; Lv, L.; Yang, G.; Li, J.; Zhao, Y. CM-MLP: Cascade Multi-scale MLP with Axial Context Relation Encoder for Edge Segmentation of Medical Image. In Proceedings of the 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 6–8 December 2022; pp. 1100–1107. [Google Scholar]
Fan, D.P.; Ji, G.P.; Zhou, T.; Chen, G.; Fu, H.; Shen, J.; Shao, L. Pranet: Parallel reverse attention network for polyp segmentation. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2020; Springer: Cham, Switzerland, 2020; pp. 263–273. [Google Scholar]
Lee, H.J.; Kim, J.U.; Lee, S.; Kim, H.G.; Ro, Y.M. Structure boundary preserving segmentation for medical image with ambiguous boundary. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4817–4826. [Google Scholar]
Ma, S.; Li, X.; Tang, J.; Guo, F. EAA-Net: Rethinking the Autoencoder Architecture with Intra-class Features for Medical Image Segmentation. arXiv 2022, arXiv:2208.09197. [Google Scholar]
Hatamizadeh, A.; Terzopoulos, D.; Myronenko, A. End-to-end boundary aware networks for medical image segmentation. In Machine Learning in Medical Imaging: 10th International Workshop, MLMI 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 13, 2019, Proceedings; Springer: Cham, Switzerland, 2019; pp. 187–194. [Google Scholar]
Kuang, H.; Liang, Y.; Liu, N.; Liu, J.; Wang, J. BEA-SegNet: Body and edge aware network for medical image segmentation. In Proceedings of the 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 9–12 December 2021; pp. 939–944. [Google Scholar]
Han, H.Y.; Chen, Y.C.; Hsiao, P.Y.; Fu, L.C. Using channel-wise attention for deep CNN based real-time semantic segmentation with class-aware edge information. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1041–1051. [Google Scholar] [CrossRef]
Li, J.; Zhu, G.; Hua, C.; Feng, M.; Li, P.; Lu, X.; Song, J.; Shen, P.; Xu, X.; Mei, L.; et al. A systematic collection of medical image datasets for deep learning. arXiv 2021, arXiv:2106.12864. [Google Scholar] [CrossRef]
Sagheer, S.V.M.; George, S.N. A review on medical image denoising algorithms. Biomed. Signal Process. Control 2020, 61, 102036. [Google Scholar]
Goyal, B.; Dogra, A.; Agrawal, S.; Sohi, B.S.; Sharma, A. Image denoising review: From classical to state-of-the-art approaches. Inf. Fusion 2020, 55, 220–244. [Google Scholar] [CrossRef]
Fan, L.; Zhang, F.; Fan, H.; Zhang, C. Brief review of image denoising techniques. Vis. Comput. Ind. Biomed. Art 2019, 2, 7. [Google Scholar] [CrossRef] [PubMed]
Yang, T.; Zhu, Y.; Xie, Y.; Zhang, A.; Chen, C.; Li, M. Aim: Adapting image models for efficient video action recognition. arXiv 2023, arXiv:2302.03024. [Google Scholar]
Jia, M.; Tang, L.; Chen, B.C.; Cardie, C.; Belongie, S.; Hariharan, B.; Lim, S.N. Visual prompt tuning. In Computer Vision—ECCV 2022; Springer: Cham, Switzerland, 2022; pp. 709–727. [Google Scholar]
Liang, T.; Jin, Y.; Li, Y.; Wang, T. Edcnn: Edge enhancement-based densely connected network with compound loss for low-dose ct denoising. In Proceedings of the 2020 15th IEEE International Conference on Signal Processing (ICSP), Beijing, China, 6–9 December 2020; Volume 1, pp. 193–198. [Google Scholar]
Luthra, A.; Sulakhe, H.; Mittal, T.; Iyer, A.; Yadav, S. Eformer: Edge enhancement based transformer for medical image denoising. arXiv 2021, arXiv:2109.08044. [Google Scholar]
Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 2790–2799. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Chen, H.; Han, Y.; Li, Y.; Xu, P.; Li, K.; Yin, J. MS-UNet: Swin Transformer U-Net with Multi-scale Nested Decoder for Medical Image Segmentation with Small Training Data. In Proceedings of the Pattern Recognition and Computer Vision, Xiamen, China, 13–15 October 2023. [Google Scholar]
Fu, S.; Lu, Y.; Wang, Y.; Zhou, Y.; Shen, W.; Fishman, E.; Yuille, A. Domain adaptive relational reasoning for 3d multi-organ segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part I; Springer: Cham, Switzerland, 2020; pp. 656–666. [Google Scholar]
Bernard, O.; Lalande, A.; Zotti, C.; Cervenansky, F.; Yang, X.; Heng, P.A.; Cetin, I.; Lekadir, K.; Camara, O.; Ballester, M.A.G.; et al. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved? IEEE Trans. Med Imaging 2018, 37, 2514–2525. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Xie, S.; Lin, L.; Iwamoto, Y.; Han, X.H.; Chen, Y.W.; Tong, R. Mixed transformer u-net for medical image segmentation. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 7–13 May 2022; pp. 2390–2394. [Google Scholar]
Shiraishi, J.; Katsuragawa, S.; Ikezoe, J.; Matsumoto, T.; Kobayashi, T.; Komatsu, K.i.; Matsui, M.; Fujita, H.; Kodera, Y.; Doi, K. Development of a digital image database for chest radiographs with and without a lung nodule: Receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules. Am. J. Roentgenol. 2000, 174, 71–74. [Google Scholar] [CrossRef] [PubMed]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Guo, R.; Cheng, L.; Li, J.; Hahn, P.R.; Liu, H. A survey of learning causality with data: Problems and methods. ACM Comput. Surv. (CSUR) 2020, 53, 75. [Google Scholar] [CrossRef]
Yu, Y.; Chu, T.; Tong, S.; Wu, Z.; Pai, D.; Buchanan, S.; Ma, Y. Emergence of Segmentation with Minimalistic White-Box Transformers. arXiv 2023, arXiv:2308.16271. [Google Scholar]

Figure 1. The architecture of MS-UNet and the plug-and-play adaptive denoising (PAD) module. Our contributions: ❶ The MS-UNet is composed of pure Transformer modules; ❷ Multi-Scale Nested Decoder; ❸ the plug-and-play adaptive denoising (PAD) module consists of two trainable denoising modules.

Figure 2. The calculation process for the edge loss score in ELoss. The result of the model prediction is multiplied element by element with the edge ground truth generated by the edge label generation method (introduced in Section 3.3) to obtain the corresponding edge loss score.

Figure 3. The architecture of the plug-and-play adaptive denoising module. The module generates noise reduction maps from two different trainable denoising modules and computes them with the input image to obtain the final noise reduction image as model input.

Figure 4. Visualization of the segmentation effects of different models on the Synapse multi-organ CT dataset. Among them, our proposed MS-UNet outperforms other baseline models in both global and detailed segmentation.

Figure 5. Visualization of the segmentation effects of different combinations of methods on the Synapse multi-organ CT dataset (b) is the segmentation result with MS-UNet only. (c) is the segmentation result with MS-UNet and ELoss. (d) is the segmentation result with the use of MS-UNet, ELoss, and the plug-and-play adaptive denoising module.

Figure 6. The DSC score and HD score for different ratios of the Synapse, ACDC, and JSRT test datasets to MS-UNet and typical baseline models show that our proposed MS-UNet achieves the best segmentation performance, even in extreme cases using part of the training data. In this case, the DSC scores are better when they are higher in line graphs, and HD scores are better when they are lower in the bar graphs.

Figure 7. The architecture of MS-UNet with different numbers and locations of the multi-scale nested decoder blocks.

Table 1. The average Dice similarity coefficient (DSC) and the average Hausdorff distance (HD) of different methods on the complete Synapse multi-organ CT dataset.

Methods	$DSC ↑$	$HD ↓$	Aorta	Gallbladder	Kidney (L)	Kidney (R)	Liver	Pancreas	Spleen	Stomach
R50 U-Net [10]	74.68	36.87	87.74	63.66	80.60	78.19	93.74	56.90	85.87	74.16
R50 Att-UNet [10]	75.57	36.97	55.92	63.91	79.20	72.71	93.56	49.37	87.19	74.95
Att-UNet [40]	77.77	36.02	89.55	68.88	77.98	71.11	93.57	58.04	87.30	75.75
UNet [2]	$77.22 \pm 0.41$	$29.31 \pm 0.85$	$87.79$	$65.10$	$83.26$	$76.69$	$94.05$	$52.50$	$86.57$	$71.79$
R50 ViT [10]	71.29	32.87	73.73	55.13	75.80	72.20	91.51	45.99	81.99	73.95
MT-UNet [37]	78.59	26.59	87.92	64.99	81.47	77.29	93.06	59.46	87.75	76.81
TransUNet [10]	$77.07 \pm 0.34$	$31.41 \pm 1.27$	$87.07$	$61.46$	$81.47$	$75.39$	$94.10$	$55.83$	$85.24$	$77.66$
SwinUnet [11]	$78.45 \pm 0.49$	$25.69 \pm 1.28$	$85.87$	$66.62$	$82.31$	$78.95$	$93.87$	$56.13$	$89.00$	$74.84$
MS-UNet	$79.97 \pm 0.20$	$21.95 \pm 2.11$	$85.33$	$67.89$	$84.86$	$81.24$	$94.30$	$58.24$	$90.14$	$77.76$
MS-UNet + ELoss	$80.41 \pm 0.27$	$21.04 \pm 0.72$	$85.25$	$68.18$	$86.29$	$82.30$	$94.23$	$59.95$	$90.08$	$77.00$
MS-UNet + PAD Module	81.35 ± 0.02	18.22 ± 0.05	$86.33$	$67.82$	87.63	84.02	94.52	60.77	90.66	79.05

Table 2. The average Dice similarity coefficient (DSC) and the average Hausdorff distance (HD) of different methods on the ACDC dataset.

Methods	$DSC ↑$	$HD ↓$	RV	Myo	LV
U-Net [2]	$89.61 \pm 0.42$	$2.93 \pm 0.33$	85.85	87.96	95.02
TransUNet [10]	$87.29 \pm 0.16$	$2.21 \pm 0.19$	83.44	84.66	93.77
SwinUnet [11]	$90.57 \pm 0.10$	$1.39 \pm 0.16$	88.28	88.16	95.27
MS-UNet	91.01 ± 0.09	1.21 ± 0.05	89.11	88.50	95.41
MS-UNet + Loss	91.08 ± 0.04	1.24 ± 0.04	89.10	88.57	95.58
MS-UNet + Denoising	91.26 ± 0.04	1.15 ± 0.01	89.71	88.66	95.42

Table 3. The average Dice similarity coefficient (DSC) and the average Hausdorff Distance (HD) of different methods on the JSRT dataset.

Methods	$DSC ↑$	$HD ↓$
U-Net [2]	$98.0234 \pm 0.0271$	$2.5822 \pm 0.0876$
TransUNet [10]	$97.9181 \pm 0.0422$	$2.4609 \pm 0.0646$
SwinUnet [11]	$97.9218 \pm 0.0145$	$2.4662 \pm 0.0426$
MS-UNet	98.0052 ± 0.0108	2.3756 ± 0.0357
MS-UNet + Loss	98.0219 ± 0.0074	2.3586 ± 0.0282
MS-UNet + Denoising	98.0543 ± 0.0113	2.2698 ± 0.0295

Table 4. The average Dice similarity coefficient (DSC) and the average Hausdorff distance (HD) of MS-UNet and SwinUnet with the ELoss and PAD modules on the partial Synapse multi-organ CT dataset.

Volume	Methods	$DSC ↑$	$HD ↓$	Aorta	Gallbladder	Kidney (L)	Kidney (R)	Liver	Pancreas	Spleen	Stomach
$\frac{1}{4}$	SwinUnet	$74.40 \pm 0.62$	$28.01 \pm 2.31$	$81.54$	$60.74$	$80.41$	$71.65$	$92.57$	$51.31$	$85.31$	$71.70$
	SwinUnet + ELoss	$75.31 \pm 0.31$	27.44 ± 1.21	$83.28$	$62.12$	$80.94$	$72.75$	$92.35$	$52.80$	$85.48$	$72.77$
	SwinUnet + PAD Module	76.38 ± 0.20	30.11 ± 0.78	84.06	63.52	82.31	74.67	93.13	53.82	86.36	73.22
	MS-UNet	$76.23 \pm 0.23$	$26.19 \pm$ 0.82	$83.34$	$61.29$	$84.19$	$77.26$	$92.97$	$52.94$	$85.58$	$72.27$
	MS-UNet + ELoss	$76.63 \pm 0.32$	$26.37 \pm 1.33$	$82.47$	$62.95$	$84.73$	$78.08$	$92.69$	$53.65$	$85.56$	$72.95$
	MS-UNet + PAD Module	78.17 ± 0.09	24.56 ± 1.06	84.32	64.33	85.76	79.61	93.54	56.85	86.93	74.01
$\frac{1}{8}$	SwinUnet	$69.51 \pm 0.46$	$33.42 \pm 1.40$	$74.19$	$57.63$	$76.13$	$70.43$	$91.37$	$45.59$	$79.57$	$61.17$
	SwinUnet + ELoss	$71.94 \pm 0.43$	$29.86 \pm 1.85$	$77.00$	58.52	$80.36$	$74.67$	$91.49$	$47.31$	$81.10$	$65.06$
	SwinUnet + PAD Module	74.10 ± 0.15	27.17 ± 1.11	80.59	57.72	84.43	77.90	92.39	48.34	82.99	68.40
	MS-UNet	$72.26 \pm 0.39$	$27.01 \pm 1.65$	$77.81$	$59.20$	$81.51$	$75.56$	$91.87$	$46.01$	$81.34$	$64.81$
	MS-UNet + ELoss	$72.74 \pm 0.16$	$27.69 \pm$ 0.86	$77.29$	61.32	$81.35$	$75.09$	$91.39$	$47.00$	$81.12$	$65.20$
	MS-UNet + PAD Module	75.23 ± 0.11	21.45 ± 1.51	81.52	63.85	84.75	79.25	93.22	48.54	83.42	67.26
$\frac{1}{16}$	SwinUnet	$60.70 \pm 0.66$	$50.17 \pm 1.99$	$62.82$	$47.01$	73.66	68.42	$87.68$	$29.56$	77.62	$38.86$
	SwinUnet + ELoss	$63.78 \pm 0.23$	$42.11 \pm$ 1.78	$68.46$	51.45	$76.44$	$70.35$	$88.07$	32.59	79.87	$43.01$
	SwinUnet + PAD Module	64.97 ± 0.15	40.96 ± 0.73	73.15	51.00	78.49	71.99	88.10	32.52	79.86	44.66
	MS-UNet	$66.25 \pm 0.18$	$39.07 \pm 2.54$	$77.22$	$53.01$	78.50	71.68	$91.40$	$33.05$	81.33	$46.85$
	MS-UNet + ELoss	$66.38 \pm 0.12$	$38.11 \pm$ 1.38	$75.27$	53.19	$78.30$	$70.83$	$91.12$	34.12	$80.74$	$47.48$
	MS-UNet + PAD Module	67.31 ± 0.10	38.08 ± 2.42	77.70	54.53	78.29	70.71	92.17	33.89	80.90	50.28

Table 5. The average Dice similarity coefficient (DSC) and the average Hausdorff distance (HD) of MS-UNet with the ELoss and PAD modules on the partial ACDC dataset.

Volume	Methods	$DSC ↑$	$HD ↓$	RV	Myo	LV
$\frac{1}{4}$	MS-UNet	89.14 ± 0.09	1.37 ± 0.04	86.90	86.25	94.27
	MS-UNet + ELoss	89.20 ± 0.05	1.36 ± 0.03	86.79	86.37	94.44
	MS-UNet + PAD Module	89.52 ± 0.06	1.30 ± 0.01	87.16	86.82	94.60
$\frac{1}{8}$	MS-UNet	87.82 ± 0.10	1.52 ± 0.12	85.47	84.21	93.79
	MS-UNet + ELoss	88.07 ± 0.11	1.47 ± 0.02	85.68	84.57	93.98
	MS-UNet + PAD Module	88.64 ± 0.12	1.43 ± 0.03	86.15	85.49	94.29
$\frac{1}{16}$	MS-UNet	86.38 ± 0.05	2.26 ± 0.63	82.78	83.14	93.20
	MS-UNet + ELoss	86.79 ± 0.04	2.07 ± 0.57	83.42	83.46	93.50
	MS-UNet + PAD Module	87.68 ± 0.07	1.45 ± 0.03	84.57	84.64	93.85

Table 6. The average Dice similarity coefficient (DSC) and the average Hausdorff distance (HD) of MS-UNet with the ELoss and PAD modules on the partial JSRT dataset.

Volume	Methods	$DSC ↑$	$HD ↓$
$\frac{1}{4}$	MS-UNet	97.4638 ± 0.0530	2.9619 ± 0.1043
	MS-UNet + ELoss	97.5351 ± 0.0212	2.8503 ± 0.0411
	MS-UNet + PAD Module	97.6761 ± 0.0332	2.7110 ± 0.0577
$\frac{1}{8}$	MS-UNet	96.7722 ± 0.2693	4.1073 ± 0.2052
	MS-UNet + ELoss	97.0776 ± 0.0435	3.5736 ± 0.2502
	MS-UNet + PAD Module	97.3959 ± 0.0374	2.9903 ± 0.0460
$\frac{1}{16}$	MS-UNet	96.3634 ± 0.0369	6.2792 ± 0.2092
	MS-UNet + ELoss	96.4568 ± 0.0528	5.6473 ± 0.2803
	MS-UNet + PAD Module	97.0678 ± 0.0267	3.8123 ± 0.1369

Table 7. The average Dice similarity coefficient (DSC) and the average Hausdorff distance (HD) of each baseline model with the ELoss and PAD modules for the Synapse multi-organ CT dataset.

Methods	$DSC ↑$	$HD ↓$	Aorta	Gallbladder	Kidney (L)	Kidney (R)	Liver	Pancreas	Spleen	Stomach
UNet	$77.22 \pm 0.41$	$29.31 \pm$ 0.85	$87.79$	$65.10$	$83.26$	$76.69$	94.05	$52.50$	86.57	$71.79$
UNet + ELoss	$77.73 \pm 0.41$	$30.49 \pm 1.71$	88.00	$64.53$	83.41	$76.95$	$93.63$	$54.47$	86.86	$73.95$
UNet + PAD Module	78.43 ± 0.09	29.26 ± 1.78	87.64	65.91	83.09	78.61	93.74	57.98	85.69	74.76
TransUNet	$77.07 \pm$ 0.34	$31.41 \pm$ 1.27	87.07	$61.46$	$81.47$	$75.39$	$94.10$	$55.83$	$85.24$	77.66
TransUNet + ELoss	$77.67 \pm 0.39$	$31.27 \pm 1.73$	$86.42$	63.97	$81.38$	76.97	94.24	$56.41$	$85.50$	$76.44$
TransUNet + PAD Module	78.56 ± 0.44	31.27 ± 1.42	86.92	68.55	81.54	76.55	93.16	61.13	85.71	74.95
SwinUnet	$78.45 \pm 0.49$	$25.69 \pm 1.28$	$85.87$	$66.62$	$82.31$	$78.95$	93.87	$56.13$	$89.00$	$74.84$
SwinUnet + ELoss	$79.04 \pm 0.29$	$24.15 \pm$ 1.23	$85.64$	67.23	$83.41$	$79.32$	$93.85$	58.26	$89.39$	75.20
SwinUnet + PAD Module	79.22 ± 0.15	21.41 ± 1.80	85.95	68.76	84.53	80.22	93.84	55.94	90.04	74.50

Table 8. The best Dice similarity coefficient (DSC) and Hausdorff distance (HD) of different MS-UNet structures with mutil-scale nested decoder blocks on the Synapse multi-organ CT dataset.

Structure	Parmas	$DSC ↑$	$HD ↓$	Aorta	Gallbladder	Kidney (L)	Kidney (R)	Liver	Pancreas	Spleen	Stomach
None [11]	27.17 M	$78.45 \pm 0.49$	25.69 ± 1.28	85.87	$66.62$	$82.31$	$78.95$	$93.87$	$56.13$	$89.00$	$74.84$
MS-UNet Shallow	27.49 M	$79.57 \pm 0.28$	$23.83 \pm 1.33$	$85.67$	68.20	85.05	81.38	$93.89$	$56.29$	$89.99$	$76.10$
MS-UNet Deep	28.75 M	$79.62 \pm 0.35$	$22.31 \pm 2.00$	$85.01$	$67.20$	$84.77$	$81.34$	$93.59$	59.96	$89.34$	$75.19$
Standard MS-UNet	29.06 M	79.97 ± 0.20	21.95 $\pm 2.11$	$85.33$	$67.89$	$84.86$	$81.24$	94.30	$58.24$	90.14	77.76

Table 9. The average Dice similarity coefficient (DSC) and the average Hausdorff distance (HD) on the complete Synapse multi-organ CT dataset with different ELoss weights.

$ω$	$DSC ↑$	$HD ↓$	Aorta	Gallbladder	Kidney (L)	Kidney (R)	Liver	Pancreas	Spleen	Stomach
0	79.97 ± 0.20	$21.95 \pm 2.11$	85.33	$67.89$	$84.86$	$81.24$	94.30	$58.24$	90.14	77.76
0.1	80.41 ± 0.27	21.04 ± 0.72	$85.02$	68.18	86.29	82.30	94.23	59.95	90.08	77.00
0.2	$79.82 \pm 0.35$	$23.08 \pm 0.88$	84.02	67.17	$85.24$	$81.44$	$93.91$	$59.79$	89.63	$77.41$
0.5	$78.22 \pm 0.26$	$22.93 \pm 2.10$	$81.22$	$65.73$	$84.96$	$80.60$	$93.70$	$56.80$	$87.67$	$75.07$

Table 10. The average Dice similarity coefficient (DSC) and the average Hausdorff distance (HD) when introducing ELoss at different epochs to the complete Synapse multi-organ CT dataset.

Epoch Time	$DSC ↑$	$HD ↓$	Aorta	Gallbladder	Kidney (L)	Kidney (R)	Liver	Pancreas	Spleen	Stomach
Without Loss	$79.97 \pm 0.20$	$21.95 \pm 2.11$	85.33	$67.89$	$84.86$	$81.24$	94.30	$58.24$	$90.14$	77.76
Epoch 0	$80.07 \pm 0.38$	$22.69 \pm 1.33$	$85.02$	68.50	$85.03$	$81.24$	$94.06$	$59.66$	90.22	$76.86$
Epoch 50	80.41 ± 0.27	21.04 ± 0.72	$85.02$	$68.18$	86.29	82.30	94.23	59.95	90.08	77.00
Epoch 100	$79.86 \pm$ 0.12	$22.60 \pm 1.51$	$85.09$	$67.82$	$84.85$	$81.73$	$94.12$	$58.53$	$90.05$	$76.68$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, H.; Han, Y.; Yao, L.; Wu, X.; Li, K.; Yin, J. MS-UNet: Multi-Scale Nested UNet for Medical Image Segmentation with Few Training Data Based on an ELoss and Adaptive Denoising Method. Mathematics 2024, 12, 2996. https://doi.org/10.3390/math12192996

AMA Style

Chen H, Han Y, Yao L, Wu X, Li K, Yin J. MS-UNet: Multi-Scale Nested UNet for Medical Image Segmentation with Few Training Data Based on an ELoss and Adaptive Denoising Method. Mathematics. 2024; 12(19):2996. https://doi.org/10.3390/math12192996

Chicago/Turabian Style

Chen, Haoyuan, Yufei Han, Linwei Yao, Xin Wu, Kuan Li, and Jianping Yin. 2024. "MS-UNet: Multi-Scale Nested UNet for Medical Image Segmentation with Few Training Data Based on an ELoss and Adaptive Denoising Method" Mathematics 12, no. 19: 2996. https://doi.org/10.3390/math12192996

APA Style

Chen, H., Han, Y., Yao, L., Wu, X., Li, K., & Yin, J. (2024). MS-UNet: Multi-Scale Nested UNet for Medical Image Segmentation with Few Training Data Based on an ELoss and Adaptive Denoising Method. Mathematics, 12(19), 2996. https://doi.org/10.3390/math12192996

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MS-UNet: Multi-Scale Nested UNet for Medical Image Segmentation with Few Training Data Based on an ELoss and Adaptive Denoising Method †

Abstract

1. Introduction

2. Related Works

2.1. Medical Image Segmentation Models

2.2. Edge Loss

2.3. Denoising Methods

3. Method

3.1. MS-UNet

3.2. Loss Functions

3.2.1. Cross-Entropy (CE) Loss

3.2.2. Dice Similarity Coefficient (DSC) Loss

3.2.3. ELoss

3.3. Edge Label Generation Method

3.4. Plug-and-Play Adaptive Denoising Module

4. Experiments

4.1. Datasets

4.1.1. Synapse

4.1.2. ACDC

4.1.3. JSRT

4.2. Implementation Details

4.3. Experimental Results

4.3.1. Performance on Synapse

4.3.2. Performance on ACDC

4.3.3. Performance on JSRT

4.4. Ablation Study on Training Data Volume

4.4.1. Performance of MS-UNet

4.4.2. Performance of ELoss and the PAD Module

4.5. Ablation Study on ELoss and PAD Module in Other Models

4.6. Ablation Study on MS-UNet Architectures

4.7. Ablation Study on ELoss Setting

4.8. Discussion and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

MS-UNet: Multi-Scale Nested UNet for Medical Image Segmentation with Few Training Data Based on an ELoss and Adaptive Denoising Method^†