MSLUnet: A Medical Image Segmentation Network Incorporating Multi-Scale Semantics and Large Kernel Convolution

Zhu, Shijuan; Cheng, Lingfei

doi:10.3390/app14156765

Open AccessArticle

MSLUnet: A Medical Image Segmentation Network Incorporating Multi-Scale Semantics and Large Kernel Convolution

by

Shijuan Zhu

and

Lingfei Cheng

^*

School of Physics and Electronic Information Engineering, Henan Polytechnic University, Jiaozuo 454000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(15), 6765; https://doi.org/10.3390/app14156765

Submission received: 29 May 2024 / Revised: 3 July 2024 / Accepted: 29 July 2024 / Published: 2 August 2024

(This article belongs to the Special Issue Computer Vision for Medical Informatics and Biometrics Applications)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, various deep-learning methodologies have been developed for processing medical images, with Unet and its derivatives proving particularly effective in medical image segmentation. Our primary objective is to enhance the accuracy of these networks while also reducing the number of parameters and computational demands to facilitate deployment on mobile medical devices. To this end, we introduce a novel medical image segmentation network, MSLUnet, which aims to minimize parameter count and computational load without compromising segmentation effectiveness. The network features a U-shaped architecture. In the encoder module, we utilize multiple small convolutional kernels for successive convolutions rather than large ones, allowing for capturing multi-scale feature information at granular levels through varied receptive field scales. In the decoder module, an inverse bottleneck structure with depth-separable convolution employing large kernels is incorporated. This design effectively extracts spatial dimensional information and ensures a comprehensive integration of both shallow and deep features. Additionally, a lightweight three-branch attention mechanism within the skip connections enhances information transfer by capturing global contextual data across spatial and channel dimensions. Experimental evaluations conducted on several publicly available medical image datasets indicate that MSLUnet is more competitive than existing models in terms of efficiency and effectiveness.

Keywords:

medical image segmentation; multi-scale feature extraction; deep learning; convolutional neural networks (CNNs)

1. Introduction

In recent years, significant advancements have been made in computer technology and image processing within the field of medicine. The transition from manual analysis to computer-assisted processing of medical images has become a prevailing trend in healthcare. Medical image segmentation technology, which extracts features from segmented organs, isolates regions of interest and analyzes physical and pathological data [1], and lays a robust foundation for accurate disease diagnosis. In response to these developments, this paper introduces a novel medical image segmentation network, MSLUnet, which incorporates a symmetric multi-scale feature extraction module and a three-branch attention mechanism. This innovative approach significantly enhances the accuracy and efficiency of automated lesion segmentation.

Traditionally, medical image segmentation relied on techniques such as thresholding, region growing, and edge detection. These methods depended heavily on prior knowledge and manual feature design, which limited their effectiveness on complex datasets. The emergence of deep learning has revolutionized computer vision [2,3] and intelligent algorithms [4,5,6], driving significant progress in automated medical image segmentation. The robust nonlinear learning capabilities of neural networks have fostered substantial advancements in segmentation. Inspired by the Full Convolutional Network (FCN) [7], Ronneberger et al. developed UNet [8], a fully convolutional neural network tailored for automated medical image segmentation. UNet [8] is renowned for its unique encoder–decoder structure and skip connections and has shown promising results in this field. However, its convolutional encoding block is relatively simplistic and lacks robust feature extraction capabilities. Furthermore, the basic connection mechanism fails to effectively bridge the semantic gap between the encoder and decoder. To address these limitations, enhancements and variations such as Unet++ [9], Unet3+ [10], Attention UNet [11], Res-Unet [12], DoubleU-Net [13], UNeXt [14], and ConvUNeXt [15] have been developed, each contributing novel attributes to the domain of medical image segmentation.

Automatic semantic segmentation of medical images presents challenges due to complex characteristics such as low contrast, fuzzy boundaries, class imbalance, and similar intensities among organs, tissues, and lesions, along with variations in location, shape, and size. Most existing segmentation methods focus on enhancing performance; however, this often results in models that are either too complex or have an excessive number of parameters, making them unsuitable for deployment on most mobile medical devices. In this study, the MSLUnet model introduces innovative multi-scale feature extraction strategies and attention mechanisms for medical image segmentation. This approach optimizes the processing of image details and reduces the number of parameters, thereby providing an efficient and robust solution. The encoder and decoder modules of MSLUnet have been redesigned to enhance feature extraction efficiency. The encoder module incorporates a symmetric multi-scale feature extraction mechanism, primarily utilizing a single 3 × 3 basic convolutional block. This configuration allows for the acquisition of both larger-scale coarse-grained and smaller-scale fine-grained features while minimizing the number of parameters. In the decoder module, deep convolution with large kernels and point-by-point convolution using an inverse bottleneck design replace traditional methods, enhancing the comprehensive extraction of image features. Additionally, a lightweight three-branch attention mechanism introduced in the skip connections aims to more effectively integrate spatial and channel features. This mechanism captures global feature information, effectively reducing the semantic gap between the encoder and decoder.

The main contributions of this work can be divided into the following sections:

(1): We introduce a new segmentation network called MSLUnet, specifically designed for lesion segmentation in medical images. MSLUnet efficiently extracts multi-scale and global contextual information while requiring fewer parameters and less computational effort, thereby enhancing the accuracy of lesion segmentation;
(2): We present the Multi-scale Feature Extraction (MSE) block, which captures multi-scale feature information at a granular level. This block utilizes a symmetric structure with multiple basic blocks to comprehensively extract features at different scales;
(3): We propose a convolutional decoder module that combines depth-separable convolution with an inverse bottleneck structure to optimize the extraction of global contextual information and minimize the parameter count. The module is further enhanced with normalization, residual concatenation, and activation functions to improve its performance in medical image segmentation;
(4): Experimental evaluations demonstrate that our MSLUnet, with only 2.18 million parameters—just 38% of the parameters of the traditional Unet—achieves superior segmentation results compared to other models across various public medical image datasets.

The remainder of this paper is structured as follows: in Section 2, some research advances related to the work of this paper are presented. In Section 3, we elaborate on the specific architecture of MSLUnet. Specific experimental results are provided in Section 4. Section 5 discusses the limitations of the methodology used in this study and future research directions. Finally, Section 6 summarizes the paper.

2. Related Work

Segmentation methods based on deep learning overcome the limitations of traditional manual feature segmentation and highlight the great potential of automatic feature learning. UNet, a deep learning network widely used in the field of medical image segmentation, mainly adopts a U-shaped encoding–decoding structure. This structure preserves the spatial detail information of shallow features by introducing skip connections between the encoder and decoder at each layer, thus achieving accurate segmentation of medical images. Although the UNet model performs well in medical image segmentation, its convolutional block, using a fixed kernel size, has a single receptive field and is unable to handle images at different scales. Architectures with specific kernel sizes may not be able to extract all relevant features from medical image datasets of various spatial resolutions. Secondly, while skip connections offer preliminary features for the decoder, constraints in its perceptual field, biases in convolutional operations, and integration of semantically diverse features may impact segmentation performance. Based on this, researchers began further investigations into the UNet [8], proposing several improvements to address these issues. Res-Unet [12] and Dense-UNet [16] replaced traditional convolutional blocks with Res-blocks from ResNet [2] and Dense-blocks from DenseNet [17], respectively. These modifications were aimed at more efficiently extracting semantic information, thereby enhancing the accuracy of image segmentation. Amyer et al. proposed an innovative multi-task [18], multi-scale learning framework aimed at predicting patient survival and responses. This framework utilizes a multi-task learning approach along with multi-scale feature extraction branches, allowing for the simultaneous extraction of rich information from both intratumoral and peritumoral regions, thereby improving the performance of radiomic analysis. The study utilizes shared representations across related tasks to improve learning efficiency and predictive performance while also providing methodological insights for other medical image analysis applications. Additionally, the DoubleU-Net [13] enhances the capture of contextual information by combining two UNet networks and employing a pool of Atrous spatial pyramids [19] to generate more accurate segmentation masks. It is important to acknowledge, however, that this stacking structure leads to significant increases in parameter count and computational requirements.

Originally, the Transformer model [3] was developed primarily for machine translation tasks, demonstrating exceptional performance by utilizing self-attention mechanisms, layer normalization, feed-forward networks, and residual structures without relying on convolutional layers. The self-attention mechanism enabled the Transformer to excel in a variety of tasks. Given the success of Transformers, the field of image processing adopted the Vision Transformer (ViT) [20], which has garnered significant attention in the academic community due to its impressive capabilities in image recognition tasks. However, Transformers generally lack detailed localization information [21] and require more diverse datasets for training to achieve results comparable to those of CNN-based approaches. Networks such as TransUnet [22] and TransAttUNet [23] employ hybrid CNN-Transformer architectures for medical image segmentation, aiming to enhance feature extraction by combining the inductive biases of CNNs with the global contextual information processing strengths of ViT [20]. Conversely, SwinU-Net [24] represents a purely Transformer-based, UNet-like architecture for medical image segmentation that captures global and long-term semantic information through positional embeddings, hierarchical windowing, and attention mechanisms specific to Swin Transformers. These networks bear similarities to the UNet [8] architecture, particularly in terms of their comparable levels of skip connections, which may limit the effectiveness of feature fusion. Moreover, such approaches typically necessitate extensive medical datasets and substantial computational resources. However, the limited availability of medical data and the constrained computational capabilities of contemporary mobile medical devices collectively hinder the performance of these networks in medical image segmentation tasks.

Low-level semantic feature maps play a crucial role not only in capturing detailed spatial information and delineating object boundaries but also in providing essential positional information about objects. However, during the downsampling and upsampling processes, this nuanced information may gradually diminish, creating a semantic gap between the encoder and decoder feature maps. To address this issue and enhance the fusion of semantic information, Zhou et al. introduced the UNet++ [9] algorithm. This algorithm incorporates a series of nested and densely interconnected networks between the encoder and decoder, significantly enhancing performance in lesion segmentation by merging features at multiple levels. Additionally, Huang et al. developed the UNet3+ [10] algorithm, which captures both fine-grained detail and coarse-grained semantic information across the entire multi-scale spectrum. This is achieved by implementing a full-scale jump connection that directly integrates high-level and low-level semantics from feature maps at various scales. While these methods enhance segmentation capabilities through sophisticated skip connections, they also introduce inefficiencies in segmentation due to their complex network structures. Moreover, the computational demand and parameter count of these algorithms are considerably high.

Attention mechanisms can enhance model focus on target regions and minimize the impact of extraneous information, thereby improving segmentation accuracy. The Attention-UNet [11] integrates an attention module within the skip connections of UNet [8], manipulating the significance of features across various spatial locations by generating gating signals that accentuate pertinent features within specific local areas. This integration significantly bolsters the fusion of low-level and high-level semantic features. Jha et al. augmented the efficiency of medical image segmentation by embedding squeeze-and-excitation blocks, ASPP, and attention mechanisms into the Res-Unet++ [25] model. Meanwhile, MCANet [26] tackled the challenge of multi-scale information in conventional ViT-based medical image segmentation by introducing a multi-scale cross-axis attention mechanism, effectively harmonizing information across different scales to enhance segmentation outcomes. The PVT-CASCADE [27] method employs a cascaded attention decoding strategy to elevate segmentation accuracy, which is particularly effective for medical images characterized by complex structures and varying dimensions. This method refines attention allocation, and bolsters feature extraction through a multi-stage process, leading to more precise segmentation outcomes. However, while these networks bolster segmentation precision, their increased complexity necessitates greater computational resources. Such complexity could potentially restrict their applicability in resource-limited settings.

Since most of the existing models are usually large in terms of the number of parameters and computation, significant challenges arise when integrating them into resource-constrained mobile medical devices. Consequently, many researchers are dedicated to achieving lightweight models while ensuring performance retention, thereby enhancing the feasibility of practical applications. Han et al. proposed an efficient convolutional neural network structure optimized for medical image segmentation named ConvUNeXt [15]. This network utilizes enhanced convolutional blocks and finely tuned model depth to improve feature extraction capabilities while reducing computational complexity. By introducing more effective convolution strategies and optimized residual connections, ConvUNeXt achieves dual optimization of parameter efficiency and computational efficiency. UNeXt [14], a rapid medical image segmentation network based on the Multi-Layer Perceptron (MLP), breaks away from the traditional reliance on convolutional architectures. By simplifying the network structure and reducing reliance on convolution operations, this model significantly speeds up processing. Lu et al. developed the LM-Net [28] method, which optimizes medical image segmentation by designing lightweight, multi-branch modules to capture image features. Integrating multi-scale feature extraction strategies with an efficient network design, LM-Net maintains low computational resource consumption while ensuring high segmentation accuracy. Dinh et al. designed U-Lite [29] based on the principles of depthwise separable convolution. They innovatively introduced axial depth convolution modules to expand the model’s receptive field and utilized 3 × 3 axial dilated depth convolution modules to further enhance the model’s segmentation performance. This network not only leverages the strengths of CNNs but also significantly reduces computational complexity.

3. Methods

The architecture of our proposed medical image segmentation model is depicted in Figure 1. The model is structured into three principal components: the encoding phase, the decoding phase, and the skip connections. During the encoding phase, the Multi-scale Feature Extraction (MSE) block is employed to capture global multi-scale features. In the decoding phase, the Large-Kernel Convolution Feature Extraction (LKE) block is utilized to integrate spatial and positional information effectively after sampling. Within the skip connections, a three-branch attention (TA) mechanism is introduced to mitigate the semantic gap between the encoder and decoder, thereby enhancing the integration of feature information.

3.1. Multiscale Feature Extraction Encoder

Medical images exhibit complex and variable shapes and sizes, with variations in the targets that require segmentation for different types of images. Therefore, the segmentation model must demonstrate robustness in handling diverse sizes and types of targets. Drawing inspiration from VGGnet [30], MA-Net [31], and relevant literature [32,33], we have developed a multi-scale feature module to capture semantic feature information at various levels.

Figure 2 illustrates the multi-scale feature extraction (MSE) encoder module that we have developed. Initially, a 3 × 3 depth-separated convolution is utilized to expand the perceptual region and gather more local information. Subsequently, a symmetric convolutional structure is applied to extract multi-scale feature information, ensuring that all sub-feature maps contain a richer scale of information. The specific process involves dividing the 3 × 3 depth-separated convolutional feature map F into m sub-feature maps based on the number of channels. These sub-feature maps are represented as

F_{i}

, where i ∈ {1, 2, …, m}. Each sub-feature map is then inputted into the corresponding 3 × 3 elementary convolutional block for further feature extraction. The output of each sub-feature map is represented as

X_{i}

. Consequently, the output of the first branch of the symmetric convolutional structure is determined by Equation (1), and the output of the second branch is determined by Equation (2).

Y_{i}^{1} = \{\begin{matrix} \begin{matrix} X_{i} i = 1 \\ X_{i} + Y_{i - 1}^{1} 1 < i \leq m \end{matrix} \end{matrix}

(1)

Y_{i}^{2} = \{\begin{matrix} X_{i} i = 1 \\ X_{i} + Y_{i - 1}^{2} 1 < i \leq m \end{matrix}

(2)

The final output of the symmetric convolution structure

Y

can be expressed as:

Y = Y_{i}^{1} + Y_{i}^{2}

(3)

wherein,

Y_{i}^{1}

represents the output characteristic diagram of the first branch of a symmetric convolution structure and

Y_{i}^{2}

denotes the output characteristic diagram of the second branch of a symmetric convolution structure.

The successive use of multiple small convolutional kernels achieves the same receptive field as a single large kernel [30]. This approach not only reduces the number of parameters but also increases the non-linear processing layers, thereby enabling the capture of feature information from a wider range of image neighborhoods. In the symmetric convolutional structure, each 3 × 3 convolutional block extracts feature information from all corresponding sub-feature maps. Simultaneously, a larger sensory field is obtained by using the output of each small branch as part of the input for the neighboring branches. Additionally, diverse multi-scale feature information is captured by concatenating the outputs of all branches along the channel axis. Moreover, a normalization layer [34] and an activation function layer [35] follow each convolutional layer, optimizing both the information flow and processing within the network. To fully integrate the input feature maps with the output from the symmetric convolutional structure, residual connections [2] are utilized. This strategy ensures the capture of rich global multi-scale feature information, effectively bridging any potential semantic gaps within the model. The parameter is strategically set to four to maximize the model’s efficiency without compromising its structural integrity and its capability to capture detailed features.

3.2. Large-Kernel Convolution Feature Extraction Block (LKE Block)

In the decoder design, we initially chose to employ a large convolutional kernel to extract features from medical images. However, large kernels typically increase the number of parameters. To maintain network efficiency, we employed an inverse bottleneck structure combined with depth-separable convolution [36] for the decoder architecture. Depth-separable convolution involves two principal stages: depth-wise convolution (DW-Conv) and point-wise convolution (PW-Conv). Depth-wise convolution is a form of group convolution where each input channel is convolved with a separate convolution kernel, and the outputs from the different channels are then concatenated together. Point-wise convolution involves convolution using a 1 × 1-sized kernel, enabling the adjustment of channel numbers and linear combinations between channels. Initially, the depth-separable convolution performs depth-wise convolution to extract spatial information by processing each channel independently. This is followed by point-wise convolution, which fuses spatial and channel features. As depicted in Figure 3, our decoder module comprises three sequential components: a 7 × 7 depth-wise convolution followed by two 1 × 1 point-wise convolutions. The initial 7 × 7 layer captures global features from each channel. The subsequent 1 × 1 layer enhances feature representation by expanding the channel dimensions. The final 1 × 1 convolution reduces the channel count while retaining the essential features. Additionally, each convolutional stage in the decoder includes residual connections [2], normalization layers [34], and activation functions [35]. Specifically, a Global Response Normalization (GRN) [37] layer follows the second 1 × 1 convolution to boost feature diversity and enhance the overall quality of the extracted features. Our decoder module can be represented as:

y_{1} = σ (B N (D W_C o n v_{7 \times 7} (x)))

(4)

y_{2} = σ (G R N (P W_C o n v_{1 \times 1} (y_{1} + x)))

(5)

y_{3} = B N (P W_C o n v_{1 \times 1} (y_{2}))

(6)

O u t = σ (y_{2} + x)

(7)

Among them, x represents the input feature map, and

y_{1}

,

y_{2}

, and

y_{3}

, respectively, represent the outputs of the three parts of the decoder.

O u t

represents the final output of the decoder.

σ

is the GELU activation function,

B N

stands for Batch Normalization layer, and GRN represents Global Response Normalization, DW-Conv stands for depth-wise convolution, while PW-Conv refers to point-wise convolution.

3.3. Three-Branch Attention Mechanism

A significant disparity in semantic understanding exists between the encoder and decoder components within the UNet network. Directly merging low-level semantic feature maps with high-level semantic feature maps can substantially diminish the feature information. To mitigate this issue, a lightweight three-branch attention (TA) mechanism has been integrated into the skip connections. As described in reference [38], this mechanism captures global contextual information across both spatial and channel dimensions. Consequently, it enables the model to prioritize task-relevant features, thus enhancing the efficiency of information transmission and contributing to a more comprehensive representation of semantic and structural details in medical images, ultimately improving segmentation accuracy.

The three-branch attention mechanism processes spatial and channel dimensional information concurrently by capturing their interplay within the input tensor. This method not only boosts feature extraction efficiency but also reduces the risk of information loss. The output from the encoder feeds into this attention mechanism, whose architecture is illustrated in Figure 4. The input feature maps are processed through three distinct branches: the first branch captures interactions between channel C and the spatial dimension W; the second branch focuses on interactions between channel C and the spatial dimension H; and the third branch concentrates on establishing spatial attention. The outputs from these branches are then combined using a basic averaging technique, and the aggregated outputs are merged with the up-sampled feature data to integrate low-level semantic features.

Mathematically, the three-branch attention mechanism can be defined as:

y_{1} = σ (B N (C o n v (Z - P o o l (x_{1})))) ⊙ x_{1}

(8)

y_{2} = σ (B N (C o n v (Z - P o o l (x_{2})))) ⊙ x_{2}

(9)

y_{3} = σ (B N (C o n v (Z - P o o l (x)))) ⊙ x

(10)

a t_o u t = A V G (y_{1} + y_{2} + y_{3})

(11)

Among them, the input feature map

x \in R^{C \times H \times W}

, the first two branches convert the feature map tensors into

x_{1} \in R^{H \times C \times W}

and

x_{2} \in R^{W \times H \times C}

through a rotation operation.

σ

represents the sigmoid activation function for calculating attention weights, BN represents the Batch Normalization layer, and

⊙

represents element-wise multiplication.

a t_o u t

represents the final output of the attention mechanism.

The Z-Pool layer in the three-branch attention mechanism reduces the 0th dimension of the feature graph to 2. It extracts the average pooled features and maximum pooled features in this dimension for the splicing process. This operation not only reduces the computational load but also preserves the rich information of the tensor. Mathematically, the Z-Pool layer can be represented as:

Z - p o o l (x) = [{M a x P o o l}_{0 d} (x), {A v g P o o l}_{0 d} (x)]

(12)

where 0d is the 0th dimension for maximum pooling and average pooling operations.

3.4. Downsampling and Upsampling

The downsampling process incorporates both a batch normalization layer and a convolutional layer. We selected a convolutional layer with a stride of 2 and a kernel size of 2 to spatially downsample the image. This method of independent spatial downsampling not only stabilizes the training process but also enhances the network’s segmentation performance.

Similarly, the upsampling process includes a batch normalization layer and a convolutional layer. The convolutional layer performs 2-fold upsampling of the feature map through bilinear interpolation. It is configured with a kernel size of 3 × 3, a stride of 1, and a padding of 1. Employing bilinear interpolation allows for an effective enhancement of the feature map resolution while preserving the image’s essential features.

4. Experiments

4.1. Datasets

Medical image data are a valuable resource, yet accessing medical image datasets is challenging due to patient identity protection and the complexity of data sharing. Fortunately, some researchers and research organizations have made publicly available datasets for the development of medical image segmentation tasks. In our study, we selected three widely recognized biomedical image datasets to validate our model: (1) The BUSI dataset [39] comprises breast ultrasound images of 600 female patients aged between 25 and 75. The dataset consists of 780 images, each with a corresponding truth mask. It features 133 normal breast images, 437 benign tumor images, and 210 malignant tumor images. For our study, we selected 647 images of benign and malignant tumors, which are provided in three-channel color format (commonly referred to as RGB in the context of digital imaging). This color information is crucial as it enhances the visibility of various tissue characteristics that are important for accurate tumor segmentation. (2) Kvasir-SEG [40] is an endoscopic dataset for pixel-level segmentation of colon polyps. It includes 1000 images of gastrointestinal polyps and their corresponding segmentation masks, which have been personally labeled and verified by senior gastroenterologists. (3) The ISIC 2018 [41] dataset is a large-scale dermoscopic image dataset published by the International Skin Imaging Collaboration (ISIC). The mask image data are generated using a variety of techniques and reviewed and curated by professional dermatologists. Due to the limited resources of computer equipment, we resized the input images of the BUSI, Kvasir-SEG, and ISIC 2018 datasets to 256 × 256 to ensure the experiments could be conducted. Additionally, we randomly partitioned the datasets into training and test sets in a 4:1 ratio for experimental evaluation, as shown in Table 1.

4.2. Experimental Details

We utilized Python 3.8 as the programming language and PyTorch 1.10.2 as the deep learning framework for our experiments. We employed AdamW [42] as the optimizer and conducted training on an NVIDIA 3080 GPU over 300 epochs, setting the weight decay factor at 5 × 10⁻⁵ and the initial learning rate at 4 × 10⁻⁴. To evaluate the segmentation performance of various algorithms, we incorporated widely recognized metrics such as mean Intersection over Union (mIoU), Dice coefficient, recall, precision, and specificity. These metrics are extensively used in medical image segmentation to objectively assess the accuracy and effectiveness of segmentation algorithms. Furthermore, we employed a combination of cross-entropy loss and Dice loss to gauge the model’s loss function. Due to the limited availability of partially public datasets, we implemented data augmentation techniques, including random resizing, random cropping, horizontal flipping, and vertical flipping, to enhance our dataset.

4.3. Experimental Results

To assess the robustness and effectiveness of the method proposed in this paper, we conducted comparisons with state-of-the-art networks that are commonly used for medical image segmentation. To ensure the fairness and validity of the experimental results, we employed three publicly available datasets for validation: the BUSI dataset, the Kvasir-SEG dataset, and the ISIC 2018 dataset.

4.3.1. Comparison of BUSI Dataset

The performance of our MSLUnet network, compared with other models on the BUSI dataset, is detailed in Table 2. Optimal experimental results are highlighted in bold in the table. As shown in Table 2, despite the limited data in the BUSI dataset, MSLUnet outperforms other segmentation models in metrics like MIoU, Dice, Recall, and Specificity, achieving optimal results. Moreover, this paper’s method demonstrates a 0.7% improvement in the Dice coefficient and a 0.3% enhancement in the mean intersection over union (mIoU) compared to the state-of-the-art LM-Net [28].

Further visual comparative analysis is provided in Figure 5, which illustrates the visual segmentation results of various networks on the BUSI dataset. Segmenting breast lesions is particularly challenging due to similar grayscale values, variable tumor morphology, and indistinct boundaries, especially in irregularly shaped malignant tumors. The visual results indicate that most networks perform well with single-shaped breast tumors. However, the UNet network struggles with accurately delineating irregularly shaped lesions, often misclassifying dispersed regions as background. Several other models occasionally omit parts of the lesions; the MSLUnet network consistently demonstrates superior segmentation accuracy compared to all other methods evaluated.

4.3.2. Comparison of Kvasir-SEG Dataset

The experimental results obtained by employing MSLUnet alongside various segmentation models on the Kvasir-SEG dataset are presented in Table 3. The highlighted section of the table indicates the optimal experimental results. An analysis of Table 3 reveals that MSLUnet outperforms other models across three evaluation metrics: Mean Intersection over Union (mIoU), Dice coefficient, and Specificity. Notably, the proposed method increases the Dice coefficient by 0.6%, the mIoU by 1.2%, and the specificity by 0.7%, outperforming the advanced GCASCADE [45] network.

To better illustrate the comparative analysis, a visual comparison of segmentation results on the Kvasir-SEG dataset is depicted in Figure 6. The figure shows that most networks effectively segment small and distinct target regions. However, MSLUnet demonstrates superior capability in accurately delineating targets within endoscopic images of colon polyps, which are known for their intricate shapes, compared to other segmentation networks.

4.3.3. Comparison of ISIC 2018 Dataset

The qualitative experimental results of MSLUnet, compared to other segmentation models on the ISIC 2018 dataset, are outlined in Table 4. The bolded section of the table highlights the optimal experimental results. According to Table 4, MSLUnet outperforms other models in four key metrics: mIoU, Dice, Precision, and Specificity. Moreover, the algorithm introduced in this paper shows improvements of 0.5% in Dice coefficient, 0.8% in mIoU, 0.4% in Precision, and 0.3% in Specificity over the state-of-the-art GCASCADE [45] network.

For a visual enhancement of the comparative analysis, Figure 7 presents the segmentation results of each network on the ISIC 2018 dataset. The visualization indicates that for dermatological images without hair obstruction and with clear distinctions between the target and background, both the MSLUnet and other networks accurately segment the lesion regions. Most current advanced networks, however, struggle with incomplete segmentation when faced with dermatoscopic images that feature complex shapes, non-uniform colors, and intricate edges. In contrast, our MSLUnet network demonstrates superior segmentation capability, achieving more comprehensive segmentation. Nonetheless, it still exhibits some limitations in accurately segmenting edge regions with complex shapes.

4.3.4. Statistical Test Analysis

In an analysis of variance (ANOVA), the F-statistic and the p-value are two key indicators that can be used to assess whether differences between clusters are statistically significant. The F-statistic is used to compare the ratio of the difference between the means of multiple clusters to the within-group difference. In the ANOVA test, the F-statistic is calculated as follows:

F = \frac{M S B}{M S W}

(13)

where MSB (Mean Square Between groups) represents the variance between groups and MSW (Mean Square Within groups) represents the variance within groups. The larger the F-value, the greater the variance between groups, indicating a more significant difference in the means of different clusters.

The p-value represents the probability of observing the F-statistic or a more extreme value if the original hypothesis (i.e., assuming that the means of all clusters are equal) is true; p-values are used as a measure of the consistency of the data with the original hypothesis. Smaller p-values imply that the probability of observing the experimental data, if the original hypothesis is true, is lower. This justifies the rejection of the original hypothesis to a greater extent; p-values can range from 0 to 1. Typically, the p-value will be lower than the significance level α (we chose α to be 0.01). If the p-value is less than α, the statistical results are considered statistically significant, and the original hypothesis is rejected, indicating significant differences between clusters.

In order to more accurately compare the performance of each model, we tested the differences in the model Dice coefficient using analysis of variance (ANOVA) to determine if there were statistically significant differences in model performance. The purpose of this experiment was to evaluate the performance of various medical image segmentation models in processing the three medical image datasets we utilized and to compare their accuracy and stability. We compared several popular deep-learning segmentation models in Table 2, Table 3 and Table 4. Each model was trained on the same hardware and software configurations, using the same number of training rounds and learning rates. To evaluate the stability of each model, four independent runs were conducted under the same initialization conditions, with each run employing a distinct partition of images for assessment. This methodology is designed to ensure the reliability of the results and to mitigate errors potentially caused by biases in data selection. The final ANOVA analysis results are shown in Table 5.

From the analysis results in Table 5, it can be seen that the F-statistic values of each dataset are very high, and the corresponding p-values are also very small (much smaller than the commonly used significance level of 0.01). This indicates that there are obvious differences in the performance of various models on different datasets. In other words, different models exhibit varying adaptability and effectiveness on different datasets.

-: BUSI dataset: The F-statistic is 302.87 with a p-value of 1.78 × 10⁻²⁹, indicating an extremely significant difference in performance between models on this dataset. This could mean that some models perform very well on this type of data while others perform very poorly with high variability.
-: Kvasir-SEG dataset: The F-statistic is 126.46 with a p-value of 2.39 × 10⁻²³, indicating lower variability in performance among models compared to the BUSI dataset. However, the significance level remains very high.
-: ISIC_2018 dataset: The F-statistic is 62.44, and the p-value is 1.62 × 10⁻¹⁸, indicating the smallest difference in performance between the models compared to the previous two datasets, albeit still significant.

In order to make the comparison of results more intuitive, we visualize the results using box plots. From the experimental results, we can see that our proposed MSLUnet demonstrates optimal performance compared to all other models, showing high stability and accuracy.

From Figure 8, we can observe significant differences in the performance of various medical image segmentation models based on the Dice coefficient. Among these, MSLUnet and LM-Net perform optimally, exhibiting the highest median values and a more concentrated distribution. This indicates that these two models have higher consistency and stability in medical image segmentation tasks. MSLUnet slightly outperforms LM-Net, evidenced by a higher upper limit of Dice scores, suggesting that MSLUnet can achieve superior segmentation accuracy in some instances. AAU-Net and SwinUnet also demonstrate commendable performance, with their median and distribution of Dice scores indicating efficient image segmentation capabilities, albeit slightly less so than MSLUnet and LM-Net. ResUnet++ and UNet++ display moderate performance, characterized by median values in the middle range and a relatively wide distribution, which suggests that their performance may fluctuate widely across test samples. UNet and CMUNeXt perform slightly below the middle of the range but remain within acceptable limits. Conversely, Attention_Unet, ULite, and ConvUNeXt exhibit poorer performance.

From the box plots depicted in Figure 9, it is evident that our model, MSLUnet, achieves the highest Dice scores, exhibiting the highest median and a compact distribution across the four trials, which indicates stable and excellent performance. GCASCADE and CMUNeXt also exhibit high Dice scores, especially GCASCADE, whose upper quartile nearly reaches 92%, showing very high segmentation accuracy. LM-Net and AAU-Net also demonstrate outstanding performance, with Dice scores over 90%, albeit slightly lower than those of GCASCADE and CMUNeXt. In contrast, SwinUnet shows relatively poorer performance, with a wide distribution of scores that reach a low of approximately 83%, potentially leading to unstable performance in certain cases. ULite and Attention_Unet perform in the low to mid-range, suggesting potential limitations in handling complex medical image segmentation tasks.

From Figure 10, we can observe the Dice coefficient performance of various models in the medical image segmentation task. The MSLUnet model in this paper demonstrates the highest scores on all Dice coefficients with less volatility. This indicates that MSLUnet demonstrates high consistency and accuracy in medical image segmentation and is the top performer among all the models. The two models, UNeXt and GCASCADE, also achieve high scores close to that of MSLUnet, but the volatility of GCASCADE is slightly higher. The scores of UNeXt range from 89.2 to 90.3, while GCASCADE’s scores range from 89.4 to 90.9. These two models also demonstrate good segmentation results and are suitable for medical image processing that demands higher accuracy. Both the TransUnet and CMUNeXt models consistently deliver high performance. The TransUnet scores range from 87.8 to 88.3, while the CMUNeXt scores range from 88.1 to 88.5. These models may be slightly inferior when dealing with complex medical images, but they are still reliable choices. The UNet++, Attention UNet, and ConvUNeXt models have scores centered in the medium range with moderate volatility. The ULite and ResUnet++ models have relatively low scores and high volatility. Specifically, ResUnet++ has the lowest score of 85.4, indicating that the segmentation performance may not be very stable in certain cases.

4.3.5. Analysis

Experimental evaluations on three publicly available datasets have demonstrated that MSLUnet outperforms other Unet models and their variants in terms of segmentation accuracy. This finding confirms the effectiveness and feasibility of our model’s architecture. The superior segmentation performance of MSLUnet can be attributed to its utilization of a multiscale symmetric convolutional structure, which facilitates the extraction of diverse spatial features across various scales. The integration of a multi-scale feature extraction encoder enables the model to effectively capture both multi-scale features and global contextual information, thereby enhancing the accuracy of segmenting structures within medical images.

Furthermore, the integration of a three-branch attention mechanism within the skip connections ensures that the model concentrates more effectively on the target area by suppressing irrelevant features. The qualitative results, illustrated in Figure 5, Figure 6 and Figure 7, demonstrate MSLUnet’s strong segmentation performance across targets of varying sizes. This proficiency is closely linked to the combination of a multi-scale convolutional block encoder and a large convolutional kernel block decoder. Together, they facilitate the acquisition of comprehensive representations that encompass essential low-level and high-level semantic features for medical image segmentation.

Moreover, we plotted the mean Intersection over Union (mIoU) against the number of parameters and GFLOPs by averaging the segmentation performance across the three datasets. As shown in Figure 11, compared to lightweight segmentation networks such as UNeXt and LM-Net, MSLUnet has a higher number of parameters and consumes more computational resources. However, it significantly outperforms the existing state-of-the-art (SOTA) models in terms of image segmentation accuracy and processing complex medical images. This advantage is primarily attributed to its multi-scale feature extraction and three-branch attention mechanism, which are designed to significantly enhance the model’s capacity to capture subtle features and adapt to structures of varying sizes. Consequently, our model effectively grasps global contextual information and multi-scale features, thereby improving its ability to accurately segment structures in medical images. However, it should be noted that the complex shapes and features of medical images may lead to the loss of deep edge refinement features. This loss can easily result in biased edge segmentation for images with intricate edges. As shown in Figure 7, our segmentation network performs better in accurately segmenting approximate regions compared to other networks. However, it struggles with segmenting complex edges.

4.3.6. Ablation Experiments

To fully evaluate the effectiveness of our proposed MSLUnet, we conducted comprehensive ablation experiments on three datasets to analyze the contribution of each module. In the ablation experiments, we selected the classical UNet network as the baseline network, and Table 6 presents the experimental results of different components on each dataset.

In our ablation studies, we initially incorporated the Multiscale Feature Extraction Encoder (MSE) module into the traditional UNet architecture. Experimental results clearly demonstrate that this integration not only improves segmentation performance but also decreases the number of parameters and computational load. This improvement is primarily due to the MSE block’s use of a continuous 3 × 3 small convolutional kernel, which efficiently captures diverse multi-scale feature information and addresses potential semantic gaps in the model. As a result, it leads to a more effective feature representation and a decrease in redundant parameters.

Subsequently, we introduced the Large-Kernel Convolution Feature Extraction (LKE) module into the baseline network, which significantly improved the segmentation performance. The LKE module efficiently extracts global contextual information by utilizing deeply separable large convolutional kernels, thereby enhancing the network’s feature fusion capabilities.

Moreover, there is a significant semantic gap between shallow and deep features; simple concatenation can result in the loss of some important features. The integration of a lightweight three-branch attention mechanism into the skip connections significantly enhances the capture of medical information by effectively leveraging global contextual information from both shallow feature spaces and channels. This allows the model to more effectively focus on features relevant to the current task, thus enhancing information transfer and accurately capturing medical semantic and structural details.

Overall, our ablation experiments confirm that each module in the proposed MSLUnet architecture significantly contributes to enhanced segmentation performance while maintaining a relatively low number of parameters and reducing computational demand.

5. Discussion

This paper presents a detailed review of the current major frameworks and techniques within the field of medical image segmentation, critically analyzing the limitations of these approaches in handling complex medical images. To address these challenges, we propose an innovative network called MSLUnet. This method efficiently captures semantic features and global contextual information at various scales through symmetric multi-scale feature extraction modules and depth-separable large-core convolutional blocks. Additionally, we introduce a lightweight three-branch attention mechanism that enables simultaneous processing of information across spatial and channel dimensions. This mechanism effectively reduces semantic discrepancies between the encoder and decoder, significantly minimizing information loss. Our MSLUnet model has been validated across multiple publicly available datasets, demonstrating superior performance compared to other state-of-the-art methods.

In a study examining accelerated cardiac T1 mapping, various deep learning architectures were implemented within MyoMapNet [46] to enhance the estimation of T1 from a minimal set of T1-weighted images. MyoMapNet demonstrated that both fully connected and U-Net architectures could achieve remarkable accuracy and quality in T1 estimation. Specifically, U-Net outperformed in terms of precision, showcasing the potential of encoder–decoder networks with skip connections for medical imaging tasks. This aligns with our findings in MSLUnet, where the incorporation of multi-scale features and attention mechanisms similarly boosted segmentation accuracy and detail resolution. Building upon the strengths observed in U-Net from the MyoMapNet study, our MSLUnet incorporates enhanced multi-scale feature extraction modules and a lightweight attention mechanism to further refine the precision and adaptability of segmentation. This approach not only addresses the challenges seen in conventional architectures but also advances the state of medical image segmentation toward achieving higher fidelity.

Despite the notable enhancements in lesion segmentation, the model still exhibits some limitations, particularly in capturing deep semantic details and handling irregularly shaped lesion edges. We acknowledge the potential for further enhancement in the model’s performance. In practical applications, especially when deployed on mobile medical devices, we continue to encounter challenges related to processing power, memory, and energy efficiency. Future work will focus on enhancing the precision of edge segmentation by further optimizing the network architecture and exploring adaptive optimization strategies to improve the model’s adaptability and reliability across different hardware platforms. Research from the literature [18] on multi-task learning indicates that such strategies can facilitate the sharing of underlying feature representations, allowing different tasks to utilize the same convolutional layers. This not only reduces the number of model parameters and computational resource demands but also helps the model capture universal features applicable across multiple tasks, thereby enhancing learning efficiency and generalization capability. We plan to integrate multi-task learning strategies into our future work to refine the model architecture, enabling the execution of multiple segmentation tasks while reducing parameters and simplifying network complexity. This approach will improve the model’s generality and flexibility, allowing it to effectively minimize its complexity and computational demands while maintaining high segmentation accuracy. Furthermore, considering that current experimental validations primarily focus on a limited number of publicly available datasets, we aim to explore more underutilized datasets and validate their performance across a broader range of standard benchmark tests. This strategy will allow us to comprehensively assess the robustness and practicality of MSLUnet.

6. Conclusions

In this work, we introduce a novel medical image segmentation network, MSLUnet, which achieves an optimal balance between segmentation performance and computational efficiency through a refined network design. Our extensive experiments demonstrate that a multi-scale feature extraction module combined with a large convolutional kernel is crucial for effectively processing scarce medical data and achieving high segmentation performance in medical images.

Overall, the MSLUnet architecture we propose offers a promising solution for medical image segmentation tasks, with the potential to significantly enhance both diagnostic accuracy and efficiency.

Author Contributions

Conceptualization, S.Z.; Data curation, S.Z.; Methodology, S.Z.; validation, S.Z.; Supervision, L.C.; Writing—original draft, S.Z.; Writing—review and editing, L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by the National Natural Science Foundation of China (No. 61601173).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xu, G.; Feng, C.; Ma, F. Review of Medical Image Segmentation Based on UNet. J. Front. Comput. Sci. Technol. 2023, 17, 1776–1792. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Wang, F.; Wang, X.; Sun, S. A Reinforcement Learning Level-Based Particle Swarm Optimization Algorithm for Large-Scale Optimization. Inf. Sci. 2022, 602, 298–312. [Google Scholar] [CrossRef]
Wang, Y.; Cai, S.; Chen, J.; Yin, M. SCCWalk: An Efficient Local Search Algorithm and Its Improvements for Maximum Weight Clique Problem. Artif. Intell. 2020, 280, 103230. [Google Scholar] [CrossRef]
Wang, L.; Pan, Z.; Wang, J. A Review of Reinforcement Learning Based Intelligent Optimization for Manufacturing Scheduling. Complex Syst. Model. Simul. 2021, 1, 257–270. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.W.; Wu, J. UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.J.; Heinrich, M.P.; Misawa, K.; Mori, K.; McDonagh, S.G.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef]
Jha, D.; Riegler, M.A.; Johansen, D.; Halvorsen, P.; Johansen, H.D. DoubleU-Net: A Deep Convolutional Neural Network for Medical Image Segmentation. In Proceedings of the 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), Rochester, MN, USA, 28–30 July 2020; pp. 558–564. [Google Scholar]
Valanarasu, J.M.J.; Patel, V.M. UNeXt: MLP-Based Rapid Medical Image Segmentation Network. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2022, Singapore, 18–22 September 2022; Part VI. pp. 23–33. [Google Scholar]
Han, Z.; Jian, M.; Wang, G.-G. ConvUNeXt: An efficient convolution neural network for medical image segmentation. Knowl.-Based Syst. 2022, 253, 109512. [Google Scholar] [CrossRef]
Zhou, Y.; Chang, H.; Lu, X.; Lu, Y. DenseUNet: Improved image classification method using standard convolution and dense transposed convolution. Knowl.-Based Syst. 2022; 254, 109658. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
Amyar, A.; Modzelewski, R.; Vera, P.; Morard, V.; Ruan, S. Multi-task multi-scale learning for outcome prediction in 3D PET images. Comput. Biol. Med. 2022, 151, 106208. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zuo, S.; Xiao, Y.; Chang, X.; Wang, X. Vision transformers for dense prediction: A survey. Knowl.-Based Syst. 2022, 253, 109552. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Chen, B.; Liu, Y.; Zhang, Z.; Lu, G.; Kong, A.W.-K. TransAttUnet: Multi-Level Attention-Guided U-Net with Transformer for Medical Image Segmentation. IEEE Trans. Emerg. Top. Comput. Intell. 2021, 8, 55–68. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. In Computer Vision—ECCV 2022 Workshops; Springer: Cham, Switzerland, 2023; pp. 205–218. [Google Scholar]
Jha, D.; Smedsrud, P.H.; Riegler, M.A.; Johansen, D.; Lange, T.D.; Halvorsen, P.; Johansen, H.D. ResUNet++: An Advanced Architecture for Medical Image Segmentation. In Proceedings of the 2019 IEEE International Symposium on Multimedia (ISM), San Diego, CA, USA, 9–11 December 2019; pp. 225–2255. [Google Scholar]
Shao, H.; Zeng, Q.; Hou, Q.; Yang, J. MCANet: Medical Image Segmentation with Multi-Scale Cross-Axis Attention. arXiv 2023, arXiv:2312.08866. [Google Scholar]
Titoriya, A.K.; Singh, M.P. PVT-CASCADE network on skin cancer dataset. In Proceedings of the 8th International Conference on Computing in Engineering and Technology (ICCET 2023), Hybrid Conference, Patna, India, 14–15 July 2023; pp. 480–486. [Google Scholar]
Lu, Z.; She, C.; Wang, W.; Huang, Q. LM-Net: A light-weight and multi-scale network for medical image segmentation. Comput. Biol. Med. 2024, 168, 107717. [Google Scholar] [CrossRef]
Dinh, B.D.; Nguyen, T.T.; Tran, T.T.; Pham, V.T. 1M parameters are enough? A lightweight CNN-based model for medical image segmentation. In Proceedings of the 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Taipei, Taiwan, 31 October–3 November 2023; pp. 1279–1284. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Zhao, Z.; Liu, Q.; Wang, S. Learning Deep Global Multi-Scale and Local Attention Features for Facial Expression Recognition in the Wild. IEEE Trans. Image Process. 2021, 30, 6544–6556. [Google Scholar] [CrossRef]
Quan, X.P.; Xiang, L.Y.; Ying, L. Medical Image Segmentation Fusing Multi-Scale Semantic and Residual Bottleneck Attention. Comput. Eng. 2023, 49, 162–170. [Google Scholar] [CrossRef]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [Google Scholar]
Ba, J.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Agarap, A.F. Deep Learning using Rectified Linear Units (ReLU). arXiv 2018, arXiv:1803.08375. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 16133–16142. [Google Scholar]
Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to Attend: Convolutional Triplet Attention Module. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 3138–3147. [Google Scholar]
Al-Dhabyani, W.; Gomaa, M.; Khaled, H.; Fahmy, A. Dataset of breast ultrasound images. Data Brief 2020, 28, 104863. [Google Scholar] [CrossRef] [PubMed]
Jha, D.; Smedsrud, P.H.; Riegler, M.; Halvorsen, P.; de Lange, T.; Johansen, D.; Johansen, H.D. Kvasir-SEG: A Segmented Polyp Dataset. In Proceedings of the Conference on Multimedia Modeling, Daejeon, Republic of Korea, 5–8 January 2020. [Google Scholar]
Codella, N.C.F.; Rotemberg, V.M.; Tschandl, P.; Celebi, M.E.; Dusza, S.W.; Gutman, D.; Helba, B.; Kalloo, A.; Liopyris, K.; Marchetti, M.A.; et al. Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC). arXiv 2019, arXiv:1902.03368. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Tang, F.; Ding, J.; Wang, L.; Ning, C.Y.; Zhou, S.K. CMUNeXt: An Efficient Medical Image Segmentation Network based on Large Kernel and Skip Fusion. arXiv 2023, arXiv:2308.01239. [Google Scholar]
Chen, G.; Li, L.; Dai, Y.; Zhang, J.; Yap, M.H. AAU-Net: An Adaptive Attention U-Net for Breast Lesions Segmentation in Ultrasound Images. IEEE Trans. Med. Imaging 2023, 42, 1289–1300. [Google Scholar] [CrossRef]
Rahman, M.M.; Marculescu, R. G-CASCADE: Efficient Cascaded Graph Convolutional Decoding for 2D Medical Image Segmentation. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 7713–7722. [Google Scholar]
Amyar, A.; Guo, R.; Cai, X.; Assana, S.; Chow, K.; Rodriguez, J.; Yankama, T.; Cirillo, J.; Pierce, P.; Goddu, B.; et al. Impact of deep learning architectures on accelerated cardiac T<sub>1</sub> mapping using MyoMapNet. NMR Biomed. 2022, 35, e4794. [Google Scholar] [CrossRef]

Figure 1. The overall structure of the proposed MSLUnet network resembles a “U” shape. The MSE Block functions as the encoder, extracting rich multi-scale fine-grained feature information. The LKE Block serves as the decoder, conducting deep feature fusion primarily through large kernel depth-separable convolution. Additionally, TA is the lightweight trisection attention mechanism that helps the model concentrate more effectively on the main feature information through skip connections.

Figure 2. Detailed structure of the multi-scale feature extraction module. This module learns multi-scale features within a basic block by introducing a symmetric structure. This approach ensures that a subset of features at both the front and back ends contains richer scale information. DW-Conv denotes depth-wise convolution. BN represents Batch Normalization. The numbers 1, 2, 3, and 4 in the figure represent that the feature map of each branch is divided into 4 parts according to the number of channels. The circular boxes indicate element-by-element addition, and the arrows indicate the flow direction of the data. Rectangular boxes connected indicate splicing operations.

Figure 3. Detailed structure of the large-kernel convolutional feature extraction module, which is the primary component of the decoder. This module uses a depth-wise convolution with a large kernel size to extract the global information from each channel, followed by a residual connection. Two point-wise convolutions are applied after the depth-wise convolution, and an inverted bottleneck design is implemented to comprehensively integrate spatial and channel information.

Figure 4. Structure of the three-branch attention mechanism module. The topmost branch is primarily responsible for computing the attentional weights across the channel dimension C and the spatial dimension W. Similarly, the middle branch is responsible for the attentional weights across the channel dimension C and the spatial dimension H. The bottom branch is utilized to capture spatial dependencies. In the first two branches, a rotation operation is used to establish the link between the channel dimension and the spatial dimension. Finally, the weights are clustered through an averaging operation.

Figure 5. Qualitative comparison results of U-Net, U-Net++, Attention U-Net, Res-U-Net++, SwinU-Net, ConvUNeXt, CMUNeXt, AAU-Net, LM-Net, and MSLUNet on the BUSI dataset. The blue boxes represent our designed MSLUnet networks, while the red boxes highlight the segmentation regions where the MSLUnet networks outperform other models.

Figure 6. The qualitative comparison results of U-Net, U-Net++, Attention U-Net, Res-UNet++, SwinU-Net, ConvUNeXt, CMUNeXt, GCASCADE, LM-Net, and MSLUnet on the Kvasir-SEG dataset. The blue boxes display the results of our designed MSLUnet network segmentation, while the red boxes indicate the segmentation regions where our model outperforms the other models.

Figure 7. Qualitative comparison results of U-Net, U-Net++, Attention U-Net, Res-U-Net++, TransUNet, ConvUNeXt, CMUNeXt, UNeXt, GCASCADE, and MSLUnet on the ISIC 2018 dataset. The blue boxes display the results of our designed MSLUnet network segmentation, while the red boxes indicate the segmentation regions where our model outperforms the other models.

Figure 8. Box plot of Dice coefficient for multiple models on the BUSI dataset. This figure illustrates the distribution of the Dice coefficient obtained from four different trials. We used analysis of variance (ANOVA) to evaluate the statistical significance of the variations among the different models. Boxes indicate the interquartile range (IQR), lines within the boxes indicate medians, and whiskers extend to 1.5 times the IQR.

Figure 9. Box plot of Dice coefficient for multiple models on the Kvasir-SEG dataset. This figure illustrates the distribution of the Dice coefficient obtained from four different trials. We conducted an analysis of variance (ANOVA) to evaluate the statistical significance of the variations among the different models. Boxes indicate the interquartile range (IQR), lines within the boxes indicate medians, and whiskers extend to 1.5 times the IQR.

Figure 10. Box plot of Dice coefficient for multiple models on the ISIC_2018 dataset. This figure illustrates the distribution of the Dice coefficient obtained from four different trials. We conducted an analysis of variance (ANOVA) to evaluate the statistical significance of the variations among the different models. Boxes indicate the interquartile range (IQR), lines within the boxes indicate medians, and whiskers extend to 1.5 times the IQR.

Figure 11. (a) shows the comparison of the parameters of each segmentation model, and (b) shows the comparison of FLOPs of each segmentation model. The blue box indicates the MSLUnet network of this paper.

Table 1. Detailed information on the three medical image segmentation datasets used.

Dataset	Train	Test	Input Image
BUSI [39]	517	130	256 × 256
Kvasir-SEG [40]	800	200	256 × 256
ISIC 2018 [41]	2075	518	256 × 256

Table 2. Presents a comparison of experimental results between MSLUnet and other segmentation models on the BUSI dataset.

Architecture	mIoU (%)	Dice (%)	Recall (%)	Precision (%)	Specificity (%)
UNet [8]	77.1	75.2	84.2	87.7	70.4
UNet++ [9]	78.3	75.8	83.1	91.7	70.8
Attention_Unet [11]	76.6	72.8	83.5	87.7	69.9
ResUnet++ [25]	78.6	77.0	87.5	86.6	72.6
SwinUnet [24]	78.5	77.4	87.9	89.7	72.7
ConvUNeXt [15]	77.5	73.2	82.8	90.6	70.1
CMUNeXt [43]	78.3	75.0	87.3	85.9	72.5
AAU-Net [44]	78.7	78.9	87.6	86.3	72.9
ULite [29]	76.4	72.9	82.2	89.3	69.2
LM-Net [28]	79.0	79.1	87.9	86.5	73.2
MSLUnet (our)	79.3	79.8	88.3	88.7	73.7

Table 3. Presents a comparison of experimental results between MSLUnet and other segmentation models on the Kvasir-SEG dataset.

Architecture	mIoU (%)	Dice (%)	Recall (%)	Precision (%)	Specificity (%)
UNet [8]	87.3	86.6	94.2	94.2	86.0
UNet++ [9]	90.8	90.0	95.1	95.0	87.2
Attention_Unet [11]	88.9	89.5	95.2	95.1	87.4
ResUnet++ [25]	88.3	88.0	92.7	94.5	83.3
SwinUnet [24]	82.5	84.6	91.8	93.2	82.8
ConvUNeXt [15]	90.7	89.6	94.3	95.4	86.6
CMUNeXt [43]	90.9	91.2	94.9	95.3	87.0
GCASCADE [45]	89.9	92.6	95.2	94.6	87.7
ULite [29]	90.4	88.4	95.2	94.4	86.8
LM-Net [28]	89.2	91.5	95.0	95.3	87.3
MSLUnet (our)	91.1	93.2	95.6	94.9	88.4

Table 4. Comparison of experimental results between MSLUnet and other segmentation models on the ISIC 2018 dataset.

Architecture	mIoU (%)	Dice (%)	Recall (%)	Precision (%)	Specificity (%)
UNet [8]	85.1	86.8	91.3	93.5	78.7
UNet++ [9]	86.6	87.3	91.9	93.3	81.1
Attention Unet [11]	86.0	87.1	91.1	93.7	80.1
ResUnet++ [25]	85.7	86.6	90.6	93.8	79.5
TransUnet [22]	85.9	88.3	91.8	93.8	81.0
ConvUNeXt [15]	86.2	87.2	92.0	92.9	80.5
CMUNeXt [43]	86.9	88.5	91.3	93.4	79.9
UNeXt [14]	85.4	90.3	93.2	93.9	80.7
ULite [29]	86.0	87.0	90.8	94.0	80.0
GCASCADE [45]	86.5	90.9	93.1	94.7	80.9
MSLUnet (our)	87.3	91.4	92.9	95.1	81.2

Table 5. Results of ANOVA analysis of multiple models on different datasets.

Dataset	F-Statistic	p-Value	Conclusion
BUSI [39]	302.87	1.78 × 10⁻²⁹	Significant difference
Kvasir-SEG [40]	126.46	2.39 × 10⁻²³	Significant difference
ISIC 2018 [41]	62.44	1.62 × 10⁻¹⁸	Significant difference

Table 6. Results of ablation experiments for each module on three datasets.

Architecture	Parameter (M)	FLOPs (G)	BUSI		Kvasir-SEG		ISIC 2018
Architecture	Parameter (M)	FLOPs (G)	mIoU (%)	Dice	mIoU (%)	Dice	mIoU (%)	Dice
UNet	4.32	35.61	77.1	0.737	87.3	0.866	85.1	0.848
UNet + MSE	3.86	10.55	78.2	0.779	89.9	0.897	86.9	0.896
UNet + LKE	2.64	5.21	77.9	0.764	91.8	0.904	86.5	0.893
UNet + AT	4.31	10.15	78.0	0.767	90.6	0.901	87.2	0.884
MSLUnet (our)	2.18	5.69	79.3	0.798	91.1	0.932	87.3	0.914

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, S.; Cheng, L. MSLUnet: A Medical Image Segmentation Network Incorporating Multi-Scale Semantics and Large Kernel Convolution. Appl. Sci. 2024, 14, 6765. https://doi.org/10.3390/app14156765

AMA Style

Zhu S, Cheng L. MSLUnet: A Medical Image Segmentation Network Incorporating Multi-Scale Semantics and Large Kernel Convolution. Applied Sciences. 2024; 14(15):6765. https://doi.org/10.3390/app14156765

Chicago/Turabian Style

Zhu, Shijuan, and Lingfei Cheng. 2024. "MSLUnet: A Medical Image Segmentation Network Incorporating Multi-Scale Semantics and Large Kernel Convolution" Applied Sciences 14, no. 15: 6765. https://doi.org/10.3390/app14156765

APA Style

Zhu, S., & Cheng, L. (2024). MSLUnet: A Medical Image Segmentation Network Incorporating Multi-Scale Semantics and Large Kernel Convolution. Applied Sciences, 14(15), 6765. https://doi.org/10.3390/app14156765

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSLUnet: A Medical Image Segmentation Network Incorporating Multi-Scale Semantics and Large Kernel Convolution

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Multiscale Feature Extraction Encoder

3.2. Large-Kernel Convolution Feature Extraction Block (LKE Block)

3.3. Three-Branch Attention Mechanism

3.4. Downsampling and Upsampling

4. Experiments

4.1. Datasets

4.2. Experimental Details

4.3. Experimental Results

4.3.1. Comparison of BUSI Dataset

4.3.2. Comparison of Kvasir-SEG Dataset

4.3.3. Comparison of ISIC 2018 Dataset

4.3.4. Statistical Test Analysis

4.3.5. Analysis

4.3.6. Ablation Experiments

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI