ShiftTransUNet: An Efficient Deep Learning Model for Medical Image Segmentation Using ShiftViT Framework

Zhao, Ming; Yang, Yimin; Zhang, Yonghong; Peng, Sheng-Lung

doi:10.3390/electronics13204063

Open AccessArticle

ShiftTransUNet: An Efficient Deep Learning Model for Medical Image Segmentation Using ShiftViT Framework

¹

School of Cybersecurity and Informatization, Wuxi University, Wuxi 214105, China

²

School of Computer Science, Yangtze University, Jingzhou 434025, China

³

Department of Creative Technologies and Product Design, National Taipei University of Business, Taipei 10051, China

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(20), 4063; https://doi.org/10.3390/electronics13204063

Submission received: 23 September 2024 / Revised: 14 October 2024 / Accepted: 15 October 2024 / Published: 16 October 2024

(This article belongs to the Special Issue Signal and Image Processing Applications in Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Deep learning has significantly advanced the field of medical image segmentation. However, the complexity of network structures often leads to high computational demands, limiting their practical efficiency. To enhance the efficiency of image segmentation, this paper introduces an innovative, concise, and lightweight deep learning network. First, to reduce model complexity, we replaced the attention mechanism in the traditional vision transformer (ViT) structure with a shift operation, creating the ShiftViT architecture. This substitution significantly decreased computation and the number of parameters while preserving model performance. Second, to retain and enhance fine-grained features and facilitate more precise information transfer across different layers, we employed a full-scale progressive skip connection strategy. This approach effectively integrates multi-scale feature information, further enhancing model performance. Additionally, to further reduce network complexity, inspired by the independence of probabilities, we opted for depth-wise separable convolution over traditional convolution. This enhances the relative independence between layers. Together, these modifications achieved superior segmentation results on both the Synapse and Automated Cardiac Diagnostic Challenge (ACDC) datasets compared to mainstream models, with substantial advantages in terms of computational efficiency and parameter count. The proposed approach represents an effective solution for medical image applications with limited computational resources and holds great promise for clinical practice.

Keywords:

medical image segmentation; ShiftViT; full-scale progressive skip connections; depthwise separable convolution; Flops

1. Introduction

In recent years, deep learning has contributed to remarkable progress in the field of medical image segmentation, offering substantial support for diagnosis and treatment. However, while existing image segmentation networks offer precise results, their structures are intricate, and their computational complexity is high. This poses challenges for practical applications, particularly when computational resources are constrained. It is thus necessary to find a balance between segmentation accuracy and computational efficiency.

Since the interest in convolutional neural networks (CNNs), fundamental network architectures such as LeNet [1], AlexNet [2], VGGNet [3], GoogLeNet [4], and ResNet [5] have since been introduced. CNNs ushered in a revolutionary era in medical image processing and analysis. This can be primarily attributed to their ability to handle high-dimensional data and their outstanding performance in image recognition through hierarchical learning [6]. In the domain of medical image segmentation, the U-Net [7] network model, combining encoder and decoder structures and employing skip connections to merge low-level and high-level features, holds a prominent position. This approach excels at the preservation of intricate details. The network architecture of the V-Net [8] network model, an offspring of U-Net, closely resembles U-Net. The addition of residual operations and 3D convolutional kernels equip V-Net for 3D target segmentation. Attention U-Net [9], another CNN built upon the U-Net architecture, introduced the noteworthy innovation of the Attention Gate (AG) module. This module leverages soft attention and seamlessly integrates attention mechanisms into the skip connections and up-sampling modules of U-Net, thereby achieving spatial attention refinement. The improved U-Net (mU-Net) [10] proposed by Seo et al. emphasizes the critical role of local features in CT image segmentation for the liver and liver tumors by incorporating object-dependent high-level features. This approach effectively enhances the extraction of local features, increases sensitivity to subtle changes, and significantly boosts segmentation performance, further demonstrating the necessity of leveraging local information in medical image analysis.

The Transformer architecture [11] has had a profound impact on the field of natural language processing. This model leverages self-attention mechanisms to capture long-term dependencies in sequential data. Transformers excel in tasks such as machine translation, language generation, and text categorization. The success of this architecture has given rise to models such as BERT, GPT, and T5 [12,13,14,15]. Vision Transformer (ViT) [16] demonstrated that the Transformer architecture is not only applicable to natural language processing but also to computer vision. In March 2021, Microsoft Research Asia introduced the Swin Transformer, which employs sliding windows and a hierarchical structure. Swin Transformer became the backbone of machine vision with its core design element of the “shifted window” (shift of the window partition), executed between two consecutive self-attention layers. This shift operation facilitates interactions between previously independent windows, significantly enhancing the model’s ability to capture complex relationships [17]. SwinUNet [18] then leveraged the multi-scale features of the Swin Transformer to achieve superior segmentation performance while retaining the advantages of the U-Net model, such as skip connections and upsampling modules.

Each model has its own set of strengths and limitations. The Transformer architecture excels in capturing global information and modeling long-range dependencies through its self-attention mechanism. However, its reliance on self-attention comes at the cost of higher computational demands, which limits its applicability to large image sizes or real-time applications. CNNs are more efficient but struggle to accurately model global information, potentially impacting segmentation [18,19,20]. Thus, many researchers have opted to combine CNNs with the Transformer architecture, most notably in the form of the TransUNet [19] and UTNet [20]. The former encodes labeled image patches from CNN feature maps into an input sequence, extracting global context, and then feeds it into the Transformer. The latter directly integrates self-attention into a CNN to enhance segmentation. While these methods yield accurate results, they still feature a substantial number of parameters and computational demands, thereby limiting their applicability.

This paper leverages the advantages of previous research to propose an innovative deep learning network architecture. Our primary objectives are to reduce network complexity, enhance feature representation, and maintain precision. We first reduce the computational complexity of the network by replacing the traditional attention mechanism with a lightweight shift operation within the ViT architecture. This shift operation retains the spatial structure of image information to a considerable extent while simultaneously reducing the network’s parameter count and computational load. Second, to maximize the utilization of feature information across different scales, we introduce the strategy of full-scale progressive skip connections, effectively fusing multi-scale features. This architecture, while introducing additional contextual information, brings about higher computational complexity. To mitigate this, we also introduce depthwise separable convolution. Experiments conducted on multiple datasets demonstrated the efficacy of the proposed model. Compared with the traditional TransUNet, the optimized model is superior in terms of medical image segmentation tasks and offers markedly lower computational complexity and a reduced parameter count. This network is thus more suitable for resource-constrained scenarios such as mobile devices or edge computing environments. This represents a valuable practical contribution to the field of medical image segmentation.

The remainder of this paper is structured as follows: Section 2 provides an overview of related work, while Section 3 offers a brief introduction to the proposed model. Section 4 delves into network details. Section 5 outlines implementation and presents experimental results. The paper is summarized and concluded in Section 6.

2. Related Work

2.1. Encoder-Decoder Architecture

Sutskever et al. [21] introduced the encoder-decoder architecture based on recurrent neural networks (RNNs) for machine translation tasks in 2014. This model first encodes the input sequence into a fixed-length vector and then decodes this vector into the target sequence. Faced with long sequences, this architecture is subject to gradient vanishing. Researchers have sought to address this by experimenting with alternative recurrent units like gated recurrent units (GRUs) [22]. Bahdanau et al. [23] introduced the attention mechanism, which allows models to better capture the relationship between source and target sequences, thus improving translation performance. The attention mechanism became a cornerstone of the encoder-decoder architecture and was widely applied to sequence-to-sequence tasks. Ronneberger et al. [7] then proposed the U-Net network for image segmentation, which boasts an intuitive encoder-decoder structure that extracts image features on the left while generating segmentation results on the right.

Despite the importance of these modifications, it was the introduction of the Transformer model by Vaswani et al. [11] that truly transformed the encoder-decoder structure. The novel self-attention and multi-head attention mechanisms of this architecture significantly improved parallel processing capabilities, making it the preferred choice for sequential tasks. The Transformer model not only profoundly impacted the development of the encoder-decoder architecture but also triggered a revolutionary shift in the field of natural language processing. The extensive development of the encoder-decoder architecture from RNNs to the Transformer model confirms its superiority for natural language processing and sequence-to-sequence tasks, making it the obvious choice for medical image segmentation.

2.2. ViT Structure

The Vision Transformer (ViT) architecture has evolved significantly from the original architecture proposed by Dosovitskiy in 2020 [16]. The ViT architecture applies the Transformer to image classification by segmenting the image into small chunks to model visual information. A plot for each chunk is projected as a fixed-length vector and input to the Transformer. Although originally designed for natural language processing, this approach was successfully applied to the field of computer vision. Subsequently, Touvron et al. [24] proposed the DeiT, which introduces a data enhancement strategy to improve data efficiency, especially for small datasets. In 2021, Bello et al. [25] introduced “category attention” with the CaiT to enhance image classification performance. In the same year, Swin Transformer [17] used hierarchical attention and shifted windows to improve performance and reduce computational complexity. Tokens-to-Token ViT proposed by Yuan et al. [26] enabled training on ImageNet from scratch, reducing reliance on large-scale pre-training. While these improvements enriched the functionality of ViT and increased its applicability to computer vision tasks, the substitution of the attention mechanism with the ShiftViT model in 2022 by Wang et al. [27] increased the power and flexibility of the structure. Vision tasks such as image classification, target detection, and image generation are thus provided with more options and superior performance.

2.3. Skip Connections in U-Shaped Networks

The pioneering U-shaped network U-Net introduced by Ronneberger et al. [7] in 2015 was the first to incorporate skip connections. Originally designed for biomedical image segmentation, U-Net’s seamless integration of an encoder and a decoder using skip connections enhances its ability to capture contextual information. Subsequently, the outstanding segmentation performance of U-Net has been further improved. Attention U-Net [9] and DenseUNet [28] offer more intricate skip connection strategies. In 2018, U-Net++ [29] introduced multiple levels of skip connections, enabling the capture of multi-scale information, which is particularly beneficial for segmentation. U-Net 3+ [30], proposed in 2019, adopted a dense skip connection approach, where each decoder layer is intricately connected to all encoder layers. This facilitates enhanced feature fusion, enabling the network to adapt more effectively to a wide range of complex segmentation tasks.

2.4. TransUNet Network

TransUNet is a potent method for medical image segmentation that harnesses the strengths of both Transformers and U-Net. It encodes image blocks from CNN feature maps into input sequences to capture global contextual information and subsequently employs a decoder for up-sampling and fuses the results with high-resolution CNN feature maps to achieve precise segmentation [19]. What sets TransUNet apart is its remarkable ability to effectively handle remote dependencies while fully leveraging the local information inherent to U-Net. This paper thus constructed a network model based on TransUNet to exploit the powerful encoding ability of Transformers for medical image segmentation.

3. Preliminaries

3.1. Problem Description and Mathematical Formulation

Our objective is the solution of an optimization problem, where

I

represents medical image data, and

θ

represents the parameters or hyperparameters of a segmentation model or algorithm. We seek the optimal model parameter or hyperparameter that strikes a balance between segmentation effectiveness and computational complexity:

\underset{θ}{m ax} [γ \cdot E (θ, I) + (1 - γ) \cdot C (θ)], s . t . θ_{m i n} \leq θ \leq θ_{m a x}

(1)

where

γ

is a parameter controlling the trade-off between segmentation effectiveness and computational complexity;

E (θ, I)

measures segmentation effectiveness in terms of a metric such as Dice coefficients; and

C (θ)

measures the complexity of the segmentation algorithm (usually a function of either model parameters or hyper-parameters). The constraint

θ_{m i n} \leq θ \leq θ_{m a x}

ensures that each component of model parameter

θ

lies within specified ranges

θ_{m i n}

and

θ_{m a x}

.

3.2. Loss Function

A hybrid approach combining cross-entropy loss functions (denoted

L_{C E}

and

L_{D i c e}

) is commonly employed for neural network models designed for medical image segmentation. This approach to balancing model accuracy with robustness can be formulated as the following loss function:

L (y, p) = α \cdot L_{C E} (y, p) + β \cdot L_{D i c e} (y, p)

(2)

where

L (y, p)

is the hybrid loss function,

y

is the true probability distribution, and

p

is the predicted probability distribution. The trade-off is managed by weight parameters

α

and

β

, which typically range from 0 to 1.

The cross-entropy loss function f(x), defined as follows, is commonly employed in multi-classification tasks:

L_{C E} (y, p) = - \sum_{i = 1}^{c} y_{i} \cdot \log (p_{i})

(3)

where

c

denotes the number of categories,

y_{i}

is the i-th element of the true label, and

p_{i}

is the i-th element of the model’s predictive probability.

The Dice loss function

L_{D i c e} (y, p)

, defined as follows, promotes more consistent segmentation by measuring the degree of overlap between the result and the true label:

L_{D i c e} (y, p) = 1 - \frac{2 \cdot | y \cap p | + γ}{| y | + | p | + γ}

(4)

where

γ

is a very small constant.

These loss functions can be used to formulate a holistic training objective to enhance medical image segmentation, particularly in scenarios involving complex structures or imbalanced categories. Fine-tuning parameters

α

and

β

allows the user to find a balance between segmentation accuracy and robustness.

4. Compact Deep Learning Model Using ShiftViT Framework and Optimized Skip Connections

This paper proposes an innovative network model based on the concept of the TransUNet framework and the pre-trained feature extraction network of ResNet50. In this section, we outline three key structures: ShiftViT, full-scale progressive skip connections, and depthwise separable convolution. The overall structure of the model is explained in Section 4.4.

4.1. Simplification of ViT Structure: Introducing ShiftViT

Suppose there is an input sequence X, which contains N elements, each with embedding dimension d, and the size of the attention weight matrix A computed by the self-attention mechanism is N × N. The complexity of the structure can then be expressed as O(

N^{2}

× H × d), where

N^{2}

denotes the complexity due to the fact that each element is required to compute the attention weights with other N−1 elements; H denotes the number of attention heads (i.e., the number of multiple attention weights computed for each element); and d denotes the embedding dimension, which represents the feature dimension of each element. This complexity represents the amount of computation required by the self-attention mechanism to compute the attention weights. For large input sequences, the

N^{2}

factor causes a rapid increase in computational complexity. If the number of attention heads H is also large, the complexity will increase further. This gives us the following objective function:

m i n i m i z e C o m p l e x i t y = f (N, H) .

(5)

Thus, the computational complexity of the multi-head self-attention mechanism in the encoder structure of the ViT(see Figure 1a) is high, and simplifying this module is key to a robust network structure.

Wang et al. [27] pointed out that it is not the attention mechanism but the overall framework structure that allows the network to achieve precise segmentation. Thus, replacing this mechanism with a shift operation that only extracts model features (i.e., it does not need to learn the weights and biases) further improves the precision of segmentation while simplifying the network. The improved ShiftBlock structure is shown in Figure 1b and defined as follows:

F (x) = s (x) + M (N (s (x)))

(6)

where x is the input to the structure, F(x) is the output of the structure, s(x) is the shift operation, N(·) represents the layer normalization operation, and M(·) is the multilayer perceptron MLP structure.

The shift operation is a local pixel translation that captures the contextual information of an image by establishing the relationship between pixels. The importance of local features in image segmentation tasks has been emphasized in previous studies, demonstrating that the effective utilization of local information can lead to improved segmentation performance [10]. By enhancing the retention of local features, this operation improves the model’s sensitivity to subtle changes and boundaries. Additionally, the shift operation significantly lowers computational complexity, simplifies the model structure to reduce the risk of overfitting, and improves the flow of information between different sub-regions. Its procedure is as follows: (1) select a region of the input feature map, (2) divide it into four equal parts according to the channels, and (3) perform translation on these four parts in the left, right, up, and down directions, while keeping the remaining channel unchanged. After shifting, out-of-range pixels are discarded, and empty pixels are filled with zeros. The shift operation is defined as follows with a step size of 1 pixel:

\{\begin{matrix} \hat{Z} [b, i, j, k] = Z [b, i, j + 1, k] \\ \hat{Z} [b, i, j, k] = Z [b, i, j - 1, k] \\ \hat{Z} [b, i, j, k] = Z [b, i + 1, j, k] \\ \hat{Z} [b, i, j, k] = Z [b, i - 1, j, k] \end{matrix}

(7)

where

Z

represents the input feature,

\hat{Z}

represents the output feature,

b

is the batch index,

i

and

j

are the row and column indices of the feature map, and

k

is the channel index.

The ShiftBlock structure is composed of three parts: the shift operation, layer normalization, and the MLP block, which is shown in Figure 1b. Following feature extraction by the convolutional layer, the feature map is cut into fixed-size patches, and each block is spread into a one-dimensional tensor. Then, a linear transformation is performed on each plot tensor to map it to the specified embedding dimension (embed_dim); each tensor is then input to the ShiftBlock structure for shifting. The output of the final MLP is the final output of the ShiftViT module.

4.2. Full-Scale Progressive Skip Connections

To optimize segmentation, we developed the full-scale progressive skip connection module, which uses progressive up-sampling to achieve full-scale skip connections.

4.2.1. Progressive Upsampling

In traditional up-sampling methods, low-resolution feature maps are usually restored to the original image size by interpolation operations (e.g., bilinear interpolation). However, this can cause image blurring because upsampling the feature maps one by one may result in information loss. Progressive upsampling overcomes this problem by dividing the upsampling operation into multiple stages, alternating between a convolution operation and a 2-fold upsampling operation. This gradually increases the resolution of the feature map while introducing more detailed information. This strategy helps to preserve the details and contextual information of the image, thereby improving the quality of the generated image.

Suppose we have input feature map X, and we want to upsample it to Y by alternating between convolution and two-fold upsampling as follows:

Y = {C (U (X))}^{(\log_{2} n)}

(8)

where U(X) denotes one two-fold upsampling of X and C(·) denotes a convolution of the upsampled feature map. This process can be repeated

\log_{2} n

times to achieve the desired up-sampling multiplicity. This gradual increase in the resolution of the feature map is a useful strategy for tasks such as image generation, super-resolution, and image restoration, where high-quality images are required.

4.2.2. Full-Scale Skip Connections

This module is the core of the full-scale progressive skip connections. Skip connection is a typical cross-layer information transfer method, which allows each layer of the decoder to be connected to a different layer of the encoder. This connection not only helps the model to better integrate features of different scales and layers, thus improving performance but also overcomes gradient vanishing and gradient explosion while speeding up the training process. With full-scale skip connections, we can introduce feature information from multiple scales simultaneously, thus better preserving the image structure and details. We mathematically define the full-scale skip connection as follows:

F_{o u t} = F_{0} + \sum_{i = 1}^{N} F_{i}

(9)

where

F_{o u t}

denotes the output feature map after full-scale skip connection,

F_{0}

denotes the original input feature map,

N

is the number of layers in the network, and

F_{i}

is the feature map from the i-th layer.

To implement this approach, we select the appropriate scale feature map from the encoder and then transform it to the desired size by convolution, up-sampling, or down-sampling, and then fuse the processed feature map using the concat feature fusion to obtain the corresponding decoder layer. The specific transformation and construction of decoder end

X_{D ⅇ}^{i}

are performed as follows:

X_{D ⅇ}^{i} = \{\begin{matrix} X_{E n}^{i}, ⅈ = N \\ f (c o n c a t ({C (D (X_{E n}^{k}))}_{k = 1,}^{ⅈ - 1} C (X_{E_{n}}^{i}), {C (Y (X_{D ⅇ}^{k}))}_{k = ⅈ + 1}^{N})), ⅈ = 1, \dots, N - 1 \end{matrix}

(10)

where i represents the index of a generic layer in the model,

N

is the total number of layers of the model,

C (\cdot)

denotes the convolution operation,

D (\cdot)

denotes downsampling,

Y (\cdot)

denotes progressive upsampling, and

f (\cdot)

denotes that the feature aggregation mechanism has been implemented by means of convolution, batch normalization, and the ReLU activation function.

c o n c a t (\cdot)

denotes that feature fusion has been carried out by means of concat. As can be seen, when the decoder layer coincides with the encoder layer and in construction of the i-th decoder layer, the encoder layers from the 1st to i-1st undergo operations such as downsampling and convolution, while the i-th encoder layer is subject only to convolution. Layers from i + 1 to the Nth utilize the previously constructed decoder layers and undergo upsampling and convolution. Subsequently, these processed feature maps are fused through concatenation, followed by further feature aggregation through convolution, normalization, and other operations. This connection strategy allows the decoder to focus on multi-scale information, facilitating the recovery of details while preserving global context.

4.3. Depthwise Separable Convolution

Depthwise separable convolution is a convolutional technique of CNNs that reduces both computational complexity and the number of parameters while maintaining performance. The underlying principle is the decomposition of standard convolution into two steps: depthwise convolution and point-by-point convolution. Depthwise convolution is applied independently to each channel of the input data, each with its own convolution kernel. This step captures spatial features in the input data. In point-by-point convolution, the output of the depthwise convolution is convolved using a 1 × 1 convolution kernel, and the feature maps of each channel are linearly combined to generate the final output feature map.

Assuming that the size of the input feature map is [H, W, C] and we apply convolution kernel

K \in R^{K * K * C * D}

, the size of the output feature map is [H, W, D]. Then, the computational consumption of the standard convolution is as follows:

F_s t a n d a r d = H * W * C * D * K * K

(11)

The computational consumptions of the depthwise convolution and point-by-point convolution are as follows:

F_d e p t h = H * W * C * K * K,

(12)

F_w i t h = H * W * C * D .

(13)

Then, the computational consumption of the depthwise separable convolution is as follows:

F_d e p t h w i t h = H * W * C * (K * K + D) .

(14)

These equations show that standard convolution involves a large number of computations while the total consumption of the depthwise separable convolution is relatively low. In terms of the number of parameters, the standard convolution and depthwise separable convolution follow a similar trend, as follows:

C_s t a n d a r d = K * K * C * D,

(15)

C_d e p t h w i t h = C * (K * K + D) .

(16)

Key features of depthwise separable convolution include parameter sharing, computational efficiency, and channel count maintenance. Parameter sharing reduces the number of parameters in the model, computational efficiency makes it suitable for resource-constrained environments, and channel count maintenance ensures information integrity. Overall, depthwise separable convolution is a powerful technique for convolutional operations that reduces the computational burden and number of parameters while maintaining high performance, making it particularly suitable for resource-constrained environments such as mobile devices and embedded systems.

4.4. Overarching Framework

The proposed network model is a hybrid coding network based on CNN and ShiftViT for medical image segmentation. Its network structure is shown in Figure 2 and is roughly divided into three parts: encoder, decoder, and skip connections. The encoder consists of CNN ResNet50 and ShiftViT, and the decoder is constructed layer by layer using full-scale progressive skip connections. Depthwise separable convolution is applied for robustness.

The encoder proceeds as follows: the input image first passes through a convolutional layer and a pooling layer to obtain a downsampled feature map. Downsampling helps improve computational efficiency and reduces redundancy in feature representation. To mitigate the potential loss of local context information during downsampling, the feature map is then divided into multiple non-overlapping subregions (patches), each corresponding to a vector, which serves as the input to the ShiftBlock, a structure that contains a shift operation and a feed forward network (FFN). The shift operation enables the exchange of information between different subregions without increasing parameters or computation, thereby alleviating the local feature loss caused by downsampling, while the FFN enhances the nonlinear representation. The output of ShiftBlock is reshaped into a feature map, which is used as an input to the decoder of this network.

Notably, the shift operation is designed to be particularly effective for local feature extraction, especially in medical image processing. The shift operation can pass information between neighboring subregions, smoothing out minor errors at the pixel level. Even in the presence of a few pixel shifts in the labels, the shift operation still shows high robustness and maintains good segmentation results. This property enables the model to achieve accurate segmentation results in the face of complex organ boundaries or labeling errors.

The decoder is constructed from multiple decoder layers, each of which is, in turn, constructed through full-scale progressive skip connections. These connections not only compensate for the limitations of the ShiftViT structure in capturing global context information but also enhance the flow of information between different levels, thereby improving the model’s ability to capture global features. Additionally, the use of full-scale progressive skip connections not only increases the diversity and richness of features and improves the accuracy and robustness of segmentation but also reduces the loss of information and improves the resolution and quality of features. In addition to this, the use of depthwise separable convolution reduces the number of parameters and computations to improve the efficiency and speed of the model. The final output of the decoder goes through a convolutional layer and a softmax activation function to obtain the final segmentation result.

5. Experimental Results and Analysis

We confirmed the effectiveness of the proposed segmentation method as well as its practical value for medical applications using experiments. Our experimental design, selection of datasets, evaluation metrics, experimental setup, comparison tests, and ablation tests are detailed in this section.

5.1. Datasets

Synapse Multi-Organ Segmentation Dataset (Synapse) [31]: This dataset consists of 30 cases with a total of 3779 axial clinical computed tomography (CT) images of the abdomen for medical image segmentation tasks. Each CT volume consists of 85 to 198 512 × 512-pixel slices with different voxel spatial resolutions. The dataset covers the labeling of eight abdominal organs, including the aorta, gallbladder, left kidney, right kidney, liver, pancreas, spleen, and stomach. It is divided into 18 training cases and 12 test cases for model training and performance testing.

Automated Cardiac Diagnostic Challenge dataset (ACDC) [32]: This dataset comprises magnetic resonance imaging (MRI) data from different patients for cardiac image segmentation. The MRI scans of each patient include markers of the left ventricle (LV), right ventricle (RV), and myocardium (MYO). These cine-MR images were acquired under breath-hold and contain a series of short-axis slices covering the cardiac region from the bottom of the left ventricle to the top, with slice thicknesses ranging from 5 to 8 mm and spatial resolution in the short-axis plane of 0.83 to 1.75 mm²/pixel. Based on previous research methods, such as TransUNet and SwinUNet, we divided the dataset into 70 training samples, 10 validation samples, and 20 test samples [18,19].

5.2. Evaluation Metrics

The average Dice similarity coefficient (DSC) and the average Hausdorff distance (HD) were used as performance evaluation metrics [18,19,33]. The DSC, also known as the Dice coefficient or F1 score, is used to measure the similarity between two sets and is commonly applied for medical image segmentation. It calculates the ratio of the intersection of two sets to their average size as follows:

D S C = \frac{2 \times | A \cap B |}{| A | + | B |}

(17)

where |A∩B| denotes the size of the intersection of two sets A and B, |A| denotes the size of set A, and |B| denotes the size of set B. The value of DSC ranges from 0 to 1; the closer the value is to 1, the more similar the two sets are, the higher the ratio of intersection to the total size is, and the more accurate the segmentation result is. DSC equal to 1 indicates a perfect match with no error. DSC is often used to evaluate the performance of medical image segmentation algorithms, especially when comparing the agreement between automatic segmentation and manual labeling, and higher DSC values usually indicate more accurate segmentation.

The HD is a distance metric used to measure the similarity between two sets and is commonly applied to medical image segmentation. It measures the maximum dissimilarity between two sets, i.e., the maximum distance from a point in one set to the nearest point in the other set. The computation of the HD involves two sets A and B: for each point in set A, find the closest point in set B; for each point in set B, find the closest point in set A; and then compute the maximum of these two closest distances, as follows:

H (A, B) = m a x (h (A, B), h (B, A))

(18)

h (A, B) = \max_{a \in A} \{\min_{b \in B} | | a - b | |\}

(19)

h (B, A) = \max_{b \in B} \{\min_{a \in A} | | b - a | |\}

(20)

where ||·|| is the distance paradigm between set A and set B. The smaller the value of HD, the more similar the two sets are and the closer the segmentation results are. HD equal to 0 indicates that the two sets are perfectly matched and there are no mismatched points. HD is often used to evaluate the performance of medical image segmentation algorithms, especially when comparing the consistency between automatic segmentation and manual labeling.

5.3. Experimental Setup

This experiment was performed on an Ubuntu server with a V100-SXM2-32GB graphics card, and for data augmentation, we used simple random augmentation and random flipping. The ResNet-50 model used in the hybrid encoder part has been pretrained on ImageNet [5]. On the Synapse dataset, the parameters of the experiment were set as follows: learning rate of 0.01 and batch size of 24. For the SGD optimizer, momentum was set at 0.9, weight_decay was set at 0.0001, and the random seed was set at 1234. We chose SGD because it shows good effectiveness in training deep learning models, especially segmentation tasks, and usually provides better stability and generalization performance than Adam [34] with a limited amount of data. On the ACDC dataset, due to the small size of the data, the batch size was adjusted to 8 to update the model more frequently, while the weight decay was adjusted to 0.01 to optimize the performance, and the rest of the parameters remained unchanged. We also applied a hybrid loss function of cross-entropy and Dice.

5.4. Comparison Experiment

We conducted comparative experiments on the Synapse and ACDC datasets to evaluate the performances of the proposed network models and mainstream network models.

We present the average Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) scores of various mainstream models on the Synapse dataset in Table 1. These values reflect the models’ segmentation performance across multiple anatomical structures. By analyzing the DSC values for each category, we are able to clearly assess the performance of different models on specific anatomical structures, thus effectively highlighting the strengths and contributions of the proposed models. In our comparative analysis, we explore the structural characteristics of different network models and their impact on the performance of the Synapse dataset. We find that V-Net and DARR, which use a convolution-based encoding-decoding architecture, are capable of basic segmentation but perform poorly in capturing complex anatomical structures, with DSC values of 68.81 and 69.77, respectively, which are much lower than that of our model at 79.46. U-Net and its variants (e.g., R50 U-Net and Att-UNet) enhance feature transfer, although the DSC of U-Net is 76.85, which is still lower than our model. We observe that TransNorm, which combines Transformer’s self-attention mechanism and spatial normalization, achieves a DSC value of 78.40 but is not as good as our model in detail capturing. MT-UNet, which combines multiple Transformer modules and U-Net, achieves a DSC value of 78.59, and also fails to outperform our model. Although SwinUNet shows a strong overall performance with a DSC value of 79.13, ShiftTransUNet performs more prominently in specific categories, especially in the categories that require strong global dependencies, such as the liver and pancreas, achieving high Dice values of 84.07 and 94.83, respectively, which fully validates its potential for application in medical image segmentation.

On the ACDC dataset, we analyzed the average DSCs of different medical image segmentation models, as well as the DSCs of the three classes, which are shown in Table 2. As can be seen in the table, the R50 U-Net and the R50 Att-UNet use the classical coding-decoding architecture, which, despite their good performance (with a DSC value of 87.55 and 86.75, respectively), are not as good as our model (90.28) in capturing the complex anatomical structures. TransUNet utilizes the self-attention mechanism to enhance feature extraction, with a DSC value of 89.71; however, it still does not outperform our model in detail capturing. SwinUNet, with a DSC value of 90, performs well, but our model performs better in some categories, showing an advantage in feature fusion and detail capturing. In contrast, ViT-CUP and R50-ViT-CUP, which are based on the structure of Visual Transformer, are still lower than our model, although R50-ViT-CUP has a DSC of 87.57. In specific segmentation tasks, ShiftTransUNet performs particularly well, with Dice values of 90 and 87.49 for the right ventricle (RV) and myocardium (Myo), respectively, as well as 93.35 for the left ventricle (LV), showing its effectiveness in segmenting complex structures.

By analyzing the Synapse and ACDC datasets, ShiftTransUNet shows significant potential for high-precision medical image segmentation, especially in different modalities (CT and MRI). This model not only outperforms many mainstream models in terms of overall performance but also demonstrates strong capabilities in various categories of segmentation tasks. In addition, by introducing the ShiftViT structure and depth-separable convolution, ShiftTransUNet effectively reduces the computational complexity and improves the performance, which further highlights its wide applicability in medical image segmentation.

5.5. Analytical Study

The ablation study in this research was conducted on the Synapse dataset, with detailed results presented in Table 3. This experiment aims to evaluate the impact of different structural modifications on model performance.

As shown in the table, the second row indicates that when only using TransUNet, the DSC value is 77.48, the HD value is 31.69, the computational cost is 24.66 GMac, and the number of parameters is 105.28 M. In the third row, when ShiftViT is used to replace the ViT structure in TransUNet, the DSC value slightly increases to 78.04. Although segmentation accuracy decreases, both computational cost and parameter count significantly drop to 8.94 GMac and 24.95 M, respectively, indicating that ShiftViT effectively reduces computational complexity while maintaining reasonable performance. In the fourth row, the full-scale progressive skip connections are adopted to replace the same-layer progressive skip connections in TransUNet, resulting in a significant increase in DSC to 81.06 and a decrease in HD to 28.22. Despite the computational cost and parameter count rising to 35.36 GMac and 119.48 M, respectively, the improved segmentation performance validates the effectiveness of this connection method in handling complex anatomical structures. In the fifth row, we used depthwise separable convolution to replace the convolution operation in the decoder section, resulting in a DSC value of 78.80, an HD value of 32.31, a computational cost of 22.68 GMac, and a parameter count of 101.45 M. This modification has minimal impact on segmentation accuracy but effectively reduces computational complexity, further enhancing the model’s efficiency. Finally, in the sixth row, the combination of these three innovations yields a DSC value of 79.46 and an HD value of 28.29, with computational cost and parameter count reduced to 9.46 GMac and 28.97 M, respectively. These results provide strong evidence for the effectiveness of the proposed network model, highlighting the importance of each innovation in enhancing performance and reducing complexity. In addition, Figure 3 provides a visual comparison of the segmentation results, further demonstrating the effectiveness of the proposed network structure.

6. Conclusions

In this study, we propose a method that is both efficient and performs well in the task of medical image segmentation by combining the ShiftViT structure, full-scale progressive skip connections, and depthwise separable convolution. Experimental results demonstrate that on multiple medical image datasets, the proposed method not only achieves excellent segmentation performance but also significantly reduces the number of parameters and the computational complexity of the model.

Although the performance of the proposed method is high, there remains room for improvement. First, we plan to extend our research to the field of 3D image segmentation to cope with more complex medical data, including processing images with complex structures and smaller targets. Second, we plan to continue to explore optimization strategies to further improve the computational efficiency of the model. In addition, we will focus on data processing and preprocessing to improve the quality of the input images and to overcome labeling issues to enhance the feasibility of practical applications.

Notably, our model shows good adaptability in medical imaging tasks with different modalities, demonstrating its potential for application in multimodal medical imaging. In the future, we plan to validate the model on more medical imaging tasks and different datasets to demonstrate its wide range of applications.

In summary, although the proposed method represents a valuable contribution to the goal of balancing the trade-off between performance and complexity, further research is needed to ensure the ongoing development of medical image segmentation methods and their real-world applicability in particular.

Author Contributions

Conceptualization, Y.Z.; Methodology, M.Z. and S.-L.P.; Software, Y.Y.; Formal analysis, Y.Z.; Writing—original draft, Y.Y.; Writing—review & editing, M.Z. and S.-L.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by 2023 China University Industry-University-Research Innovation Fund-New Generation Information Technology Innovation Project under Grant number 2022IT036 and 2023 IT 072.

Data Availability Statement

The data in this study are publicly available. The data were obtained from the Internet, but access is required due to privacy or ethical concerns. Access can be obtained by contacting the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 84–90. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25-28 October 2016; pp. 565–571. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; Donagh, S.M.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Seo, H.; Huang, C.; Bassenne, M.; Xiao, R.; Xing, L. Modified U-Net (mU-Net) with incorporation of object-dependent high level features for improved liver and liver-tumor segmentation in CT images. IEEE Trans. Med. Imaging 2019, 39, 1316–1325. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.D.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv 2019, arXiv:1910.03771. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghain, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 205–218. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Gao, Y.; Zhou, M.; Metaxas, D.N. UTNet: A hybrid transformer architecture for medical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part III 24. Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 61–71. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 2014, 27, 3104–3112. [Google Scholar]
Chung, J.; Gulcehre, C.; Cho, K.H.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jegou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jegou, H. Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 32–42. [Google Scholar]
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.H.; Tay, F.E.H.; Feng, J.; Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 558–567. [Google Scholar]
Wang, G.; Zhao, Y.; Tang, C.; Luo, C.; Zeng, W. When shift operation meets vision transformer: An extremely simple alternative to attention mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 22 February–1 March 2022; Volume 36, pp. 2423–2430. [Google Scholar]
Li, X.; Chen, H.; Qi, X.; Dou, Q.; Fu, C.W.; Heng, P.A. H-DenseUNet: Hybrid densely connected UNet for liver and tumor segmentation from CT volumes. IEEE Trans. Med. Imaging 2018, 37, 2663–2674. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Proceedings 4. Springer International Publishing: Berlin/Heidelberg, Germany, 2018; pp. 3–11. [Google Scholar]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.W.; Wu, J. Unet 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar]
Landman, B.; Xu, Z.; Igelsias, J.; Styner, M.; Langerak, T.; Klein, A. Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge. In Proceedings of the MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge, Munich, Germany, 5–9 October 2015; Volume 5, p. 12. [Google Scholar]
Bernard, O.; Lalande, A.; Zotti, C.; Cervenansky, F.; Yang, X.; Heng, P.; Cetin, I.; Lekadir, K.; Camara, O.; Ballester, M.A.G. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved? IEEE Trans. Med. Imaging 2018, 37, 2514–2525. [Google Scholar] [CrossRef]
Fu, S.; Lu, Y.; Wang, Y.; Zhou, Y.; Shen, W.; Fishman, E.; Yuille, A. Domain adaptive relational reasoning for 3d multi-organ segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, 4–8 October 2020; Proceedings, Part I 23. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 656–666. [Google Scholar]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Azad, R.; Al-Antary, M.T.; Heidari, M.; Merhof, D. Transnorm: Transformer provides a strong spatial normalization mechanism for a deep segmentation model. IEEE Access 2022, 10, 108205–108215. [Google Scholar] [CrossRef]
Wang, H.; Xie, S.; Lin, L.; Iwamoto, Y.; Han, X.H.; Chen, Y.W.; Tong, R. Mixed transformer u-net for medical image segmentation. In Proceedings of the ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), Singapore, 23–27 May 2022; pp. 2390–2394. [Google Scholar]
Huang, X.; Deng, Z.; Li, D.; Yuan, X.; Fu, Y. Missformer: An effective transformer for 2d medical image segmentation. IEEE Trans. Med. Imaging 2022, 42, 1484–1494. [Google Scholar] [CrossRef]
Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 574–584. [Google Scholar]

Figure 1. (a) Transformer encoder structure; (b) ShiftBlock structure.

Figure 2. Lightweight deep learning model based on ShiftViT and full-scale progressive skip connections.

Figure 3. Comparison of Ground Truth and Predicted Multi-Organ Segmentation on CT Images. Three examples (a–c) show the segmentation of various organs, including the aorta, gallbladder, left kidney, right kidney, liver, pancreas, spleen, and stomach. The top row presents the ground truth annotations, while the bottom row displays the model’s predictions. Different organs are color-coded according to the legend.

Table 1. Comparative results of various segmentation frameworks on the Synapse dataset. The table presents the average DSC and HD scores, as well as the DSC values for eight anatomical categories (aorta, gallbladder, left kidney, right kidney, liver, pancreas, spleen, and stomach). Higher DSC values indicate better segmentation accuracy, while lower HD values suggest improved boundary localization.

Framework	DSC ↑	HD ↓	Aorta	Gallbladder	Kidney (L)	Kidney (R)	Liver	Pancreas	Spleen	Stomach
V-Net [8]	68.81	-	75.35	51.87	77.1	80.75	87.84	40.05	80.56	56.98
DARR [34]	69.77	-	74.74	53.77	72.31	73.24	94.08	54.18	89.9	45.96
R50 U-Net [19]	74.68	36.97	87.74	63.66	80.6	78.19	93.74	56.9	85.87	74.16
U-Net [7]	76.85	39.7	89.07	69.72	77.77	68.6	93.43	53.98	86.67	75.58
R50 Att-UNet [19]	75.57	36.97	55.92	63.91	79.2	72.71	93.56	49.37	87.19	74.95
Att-UNet [9]	77.77	36.02	89.55	68.88	77.98	71.11	93.57	58.04	87.3	75.75
R50 ViT [19]	71.29	32.87	73.73	55.13	75.8	72.2	91.51	45.99	81.99	73.95
TransUNet [19]	77.48	31.69	87.23	63.13	81.87	77.02	94.08	55.86	85.08	75.62
TransNorm [35]	78.40	30.25	86.23	65.10	82.18	78.63	94.22	55.34	89.50	76.01
MT-UNet [36]	78.59	26.59	87.92	64.99	81.47	77.29	93.06	59.46	87.75	76.81
SwinUNet [18]	79.13	21.55	85.47	66.53	83.28	79.61	94.29	56.58	90.66	76.6
Proposed	79.46	28.28	88.08	63.71	84.07	78.99	94.83	57.13	88.39	80.5

Table 2. Comparison results of different segmentation frameworks on the ACDC dataset. The table demonstrates the average Dice similarity coefficient (DSC) and the DSC scores for the right ventricle (RV), myocardium (Myo), and left ventricle (LV) categories for each model.

Framework	DSC ↑	RV	Myo	LV
R50 U-Net [19]	87.55	87.1	80.63	94.92
R50 Att-UNet [19]	86.75	87.58	79.2	93.47
ViT-CUP [19]	81.45	81.46	70.71	92.18
R50-ViT-CUP [19]	87.57	86.07	81.88	94.75
MISSFormer [37]	87.90	86.36	85.75	91.59
UNETR [38]	88.61	85.29	86.52	94.02
TransUNet [19]	89.71	88.86	84.53	95.73
SwinUNet [18]	90	88.55	85.62	95.83
Ours	90.28	90	87.49	93.35

Table 3. Results of ablation experiments on the Synapse dataset. The table demonstrates the impact of the benchmark network TransUNet and its structural modifications, including three innovations: the ShiftViT, Full-scale progressive skip connections, and Depthwise separable convolution. Specific results cover the average Dice similarity coefficient (DSC), Hausdorff distance (HD), computational cost (Flops in GMac), and number of parameters (in M).

TransUNet	ShiftViT	Full-Scale Progressive Skip Connections	Depthwise Separable Convolution	DSC ↑	HD ↓	Flops(GMac)	Params(M)
√				77.48	31.69	24.66	105.28
√	√			78.04	30.69	8.94	24.95
√		√		81.06	28.22	35.36	119.48
√			√	78.8	32.31	22.68	101.45
√	√	√	√	79.46	28.29	9.46	28.97

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, M.; Yang, Y.; Zhang, Y.; Peng, S.-L. ShiftTransUNet: An Efficient Deep Learning Model for Medical Image Segmentation Using ShiftViT Framework. Electronics 2024, 13, 4063. https://doi.org/10.3390/electronics13204063

AMA Style

Zhao M, Yang Y, Zhang Y, Peng S-L. ShiftTransUNet: An Efficient Deep Learning Model for Medical Image Segmentation Using ShiftViT Framework. Electronics. 2024; 13(20):4063. https://doi.org/10.3390/electronics13204063

Chicago/Turabian Style

Zhao, Ming, Yimin Yang, Yonghong Zhang, and Sheng-Lung Peng. 2024. "ShiftTransUNet: An Efficient Deep Learning Model for Medical Image Segmentation Using ShiftViT Framework" Electronics 13, no. 20: 4063. https://doi.org/10.3390/electronics13204063

APA Style

Zhao, M., Yang, Y., Zhang, Y., & Peng, S. -L. (2024). ShiftTransUNet: An Efficient Deep Learning Model for Medical Image Segmentation Using ShiftViT Framework. Electronics, 13(20), 4063. https://doi.org/10.3390/electronics13204063

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ShiftTransUNet: An Efficient Deep Learning Model for Medical Image Segmentation Using ShiftViT Framework

Abstract

1. Introduction

2. Related Work

2.1. Encoder-Decoder Architecture

2.2. ViT Structure

2.3. Skip Connections in U-Shaped Networks

2.4. TransUNet Network

3. Preliminaries

3.1. Problem Description and Mathematical Formulation

3.2. Loss Function

4. Compact Deep Learning Model Using ShiftViT Framework and Optimized Skip Connections

4.1. Simplification of ViT Structure: Introducing ShiftViT

4.2. Full-Scale Progressive Skip Connections

4.2.1. Progressive Upsampling

4.2.2. Full-Scale Skip Connections

4.3. Depthwise Separable Convolution

4.4. Overarching Framework

5. Experimental Results and Analysis

5.1. Datasets

5.2. Evaluation Metrics

5.3. Experimental Setup

5.4. Comparison Experiment

5.5. Analytical Study

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI