Bidirectional Efficient Attention Parallel Network for Segmentation of 3D Medical Imaging

Wang, Dongsheng; Xv, Tiezhen; Liu, Jiehui; Li, Jianshen; Yang, Lijie; Guo, Jinxi

doi:10.3390/electronics13153086

Open AccessArticle

Bidirectional Efficient Attention Parallel Network for Segmentation of 3D Medical Imaging

by

Dongsheng Wang

^1,2,

Tiezhen Xv

^1,2,*,

Jiehui Liu

^1,2,

Jianshen Li

^1,2

,

Lijie Yang

^1,2

and

Jinxi Guo

^1,2

¹

School of Mechanical and Equipment Engineering, Hebei University of Engineering, Handan 056038, China

²

Key Laboratory of Intelligent Industrial Equipment Technology of Hebei Province, Hebei University of Engineering, Handan 056038, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 3086; https://doi.org/10.3390/electronics13153086

Submission received: 3 July 2024 / Revised: 30 July 2024 / Accepted: 31 July 2024 / Published: 4 August 2024

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

Currently, although semi-supervised image segmentation has achieved significant success in many aspects, further improvement in segmentation accuracy is necessary for practical applications. Additionally, there are fewer networks specifically designed for segmenting 3D images compared to those for 2D images, and their performance is notably inferior. To enhance the efficiency of network training, various attention mechanisms have been integrated into network models. However, these networks have not effectively extracted all the useful spatial or channel information. Particularly for 3D medical images, which contain rich spatial and channel information with tightly interconnected relationships between them, there remains a wealth of spatial and channel-specific information waiting to be explored and utilized. This paper proposes a bidirectional and efficient attention parallel network (BEAP-Net). Specifically, we introduce two modules: Supreme Channel Attention (SCA) and Parallel Spatial Attention (PSA). These modules aim to extract more spatial and channel-specific feature information and effectively utilize it. We combine the principles of consistency training and entropy regularization to enable mutual learning among sub-models. We evaluate the proposed BEAP-Net on two public 3D medical datasets, LA and Pancreas. The network outperforms the current state of the art in eight algorithms and is better suited for 3D medical images. It achieves the new best semi-supervised segmentation performance on the LA database. Ablation studies further validate the effectiveness of each component of the proposed model. Moreover, the SCA and PSA modules proposed can be seamlessly integrated into other 3D medical image segmentation networks to yield significant performance gains.

Keywords:

semi-supervised learning; image segmentation; 3D medical images; attention; consistency

1. Introduction

Medical image segmentation with high accuracy is desired for many downstream clinical applications. The accurate segmentation of medical images can provide rich and visible information for doctors’ diagnosis, helping them to diagnose the condition and formulate subsequent treatment plans. Recently, with the booming development of computer computing power and neural networks, a large number of supervised deep learning networks have been born. And, fully supervised deep learning methods have achieved state-of-the-art segmentation in many segmentation tasks [1,2,3]. However, fully supervised networks require large datasets with labeling, which is very difficult for medical images. This is because medical images can only be read by specialized doctors and only they can label the images in the medical field that they are familiar with. However, doctors do not need to perform labeling operations on the images they examine when they see a patient, because they can see and understand them themselves. This also implies the need for specialized collaboration with medical professionals to assist us in annotating specific medical images. This process is highly expensive and time-consuming. Moreover, it is often challenging to obtain datasets with a sufficient number of annotations for the required structures.

Unsupervised domain adaptation (UDA) [4,5] can be employed to address this issue. These methods generally utilize self-training strategies for learning. Such approaches primarily segment regions by leveraging large similarities in feature characteristics among regions with identical semantic labels, and dissimilarities among those with different semantic labels. Accurately defining the similarity and dissimilarity to achieve better recognition and differentiation is crucial. However, although these methods can train without labeled data, the surrogate labels they obtain often do not align well with the target semantic classes. Furthermore, similar structures in different images often exhibit dissimilarities due to various external conditions, which consistently constrain the quality of image segmentation in unsupervised learning, leading to often inadequate segmentation accuracy [6]. Even recent advancements in representation learning have marginally improved segmentation performance, but widespread acceptance and application remain distant goals.

A very feasible direction to overcome this challenge is based on semi-supervised learning methods for image segmentation. It only requires a small amount of labeled data along with a large amount of unlabeled data to train a network model with accuracy slightly lower than that of fully supervised learning [7,8]. This approach typically begins by fully supervising training using a small amount of labeled data to generate a model. This model is then utilized to produce pseudo-labels for a large amount of unlabeled data. Subsequently, these pseudo-labels are iteratively trained alongside the genuine labels. They continuously update the pseudo-labels and model data to enhance performance throughout the iterations. This method is becoming increasingly favored due to its requirement of fewer labels and its effective utilization of large amounts of unlabeled datasets. In recent years, various semi-supervised image segmentation networks have emerged in numerous applications, ranging from natural scene images to biomedical image analysis [9,10,11] and various other industries. It is gradually becoming a trend in future development due to its own advantages.

Despite the significant successes of semi-supervised learning in many fields in recent years, this does not imply that the segmentation problem in medical imaging has been perfectly solved. Current semi-supervised 3D medical image segmentation networks like MC-Net [12] can achieve around 80% Dice, but there is still a gap before they can be applied in clinical practice. Learning discriminative features from a small amount of annotated data and achieving accurate image segmentation remains a challenging task. This motivates us to construct an end-to-end joint training framework suitable for 3D medical image segmentation tasks.

This paper proposes a semi-supervised segmentation network, BEAP-Net, which is specifically tailored for 3D medical images. This network seamlessly integrates the SCA and PSA modules designed in this study, combined with mutual consistency training. It enables extraction and effective utilization of richer spatial and channel-related information, thereby achieving superior segmentation performance for 3D medical images.

Our contributions are as follows:

A novel Supreme Channel Attention (SCA) mechanism has been designed. By integrating channel information from two types of pooling operations, SCA maximizes the extraction of channel features. It also addresses the issue of errors introduced by spatial gaps between slices in previous attention networks for 3D medical images. This results in a significant performance enhancement for the network.
A Parallel Spatial Attention mechanism is proposed to synergistically collaborate with Supreme Channel Attention (SCA). In particular, we concatenate together the results after running max pooling and average pooling in parallel and added the previous input block to the feature block that came out later. This aims to extract more detailed spatial information, which can be used to assist the network in making more accurate segmentation predictions.
The proposed BEAP-Net adopts a configuration with one encoder and three slightly different decoders, constrained by consistency loss to ensure consistent outputs for a given input. This paper designs a cyclic pseudo-labeling scheme that utilizes prediction outputs from three different decoders. These predicted biases are then transformed into auxiliary supervision signals to facilitate model training. We aim for these signals to learn more information by leveraging their differences, thereby achieving performance enhancement.
This paper conducts detailed ablation studies on various improvements of the network to better understand their impact on network performance.
The BEAP-Net is evaluated on datasets from two different domains and compared with several state-of-the-art semi-supervised methods. The experiments demonstrate that the proposed approach achieves the most efficient performance with good generalization and robustness.

2. Related Work

Semi-supervised learning-based image segmentation methods primarily train on a large amount of unlabeled data, enabling them to extract more useful information along with the annotated examples. Currently, mainstream semi-supervised learning methods include pseudo-labeling [13,14], consistency regularization [15], entropy minimization [16,17], data augmentation, etc. Methods based on pseudo-labeling typically start by training on a small amount of labeled data using fully supervised learning. The predictions of the trained model are then used as pseudo-labels for the unlabeled data. These obtained pseudo-labels are subsequently utilized as a proxy ground truth to train on a large amount of unlabeled data. Zhou et al.’s self-training [18] is one of the typical examples. Pseudo-labels are updated every few training epochs because the quality of pseudo-labels is closely related to the segmentation performance of the network. Many people also use uncertainty-guided refinement [19], random propagation [20], and other methods to optimize and adjust the generated pseudo-labels. Bai et al. [9] refined the pseudo-labels of images without ground truth labels through a Conditional Random Field (CRF) and then used the refined pseudo-labels to update the network. However, this method is currently used less and less; one reason is that the current and new out-of-the-algorithm effect is better, and there is no obvious improvement after adding the CRF. There is no need to add a bunch of hyperparameters and computational complexity for marginal gains that are barely noticeable. Another very important reason is that CRF is unrelated to CNN training; they are treated separately from each other. The method based on consistency regularization expects that for different versions of the same image under different perturbations or data augmentations [21,22], the output of the network remains consistent. This is very advantageous and important for the future processing of images of the same tissue from different machines. A classic example is [23], where unlabeled samples are subjected to “weak augmentation” and “strong augmentation” to obtain predicted labels from the model, and cross-entropy loss is used to constrain the model to produce consistent outputs under different perturbations. They confirmed that it is best to use high-confidence unlabeled data for training. The minimization of entropy is also a very popular method. Lee et al. [24] use pseudo-labeling to encourage low-density separation between classes as entropy regularization for training. They leverage this to enhance the model’s discriminative ability across different categories, thereby improving the model’s performance.

Another widely used image segmentation method is the mean teacher model [25]. This method generally requires a student network and a teacher network. It will first conduct fully supervised training using a model to output predictions, thus obtaining pseudo-labels, and then enter the teacher–student network for training. The teacher model will assess the quality of the pseudo-labels, while the student network will be trained using labeled data, unlabeled data, as well as corresponding labels and pseudo-labels. Recently, this approach has developed many improved versions and has been applied in various fields of semi-supervised learning. Xie et al. [26] applied consistency loss to random augmentation and dropout, continuously replacing both the parameters and pseudo-labels of the teacher model with the new parameters and predictions of the student model, thus iteratively refining the pseudo-labels. Wang et al. [27] proposed a mean teacher network model guided by triple uncertainty. They designed two auxiliary tasks: reconstruction and predicting label distances, to help the model learn more features and achieve better prediction accuracy.

In addition to the architecture of the model, attention is also very important. A large amount of the previous literature [28,29,30] has shown that attention mechanisms can effectively improve the predictive performance of the model. Its core is to make the network pay more attention to the feature information that is more useful for the current task, thereby making the network learn more purposefully. Currently, many different attention mechanisms have been developed, and they have been widely applied across various industries. Hu et al. proposed a channel attention network called SENet [31], which can adaptively weight each channel to emphasize or suppress different channel features according to the requirements of the task. The principle of the Efficient Channel Attention (ECA) mechanism proposed by Wang et al. [32] is basically similar to SENet, but it avoids the negative impact brought by dimensionality reduction, thus achieving better performance. Woo et al. [33] proposed a Convolutional Block Attention Module (CBAM) architecture, which concatenates attention modules in both spatial and channel aspects. It utilizes convolutional operations to blend information across channel and spatial directions for extraction. This architecture indeed yields significant improvements for many networks.

Currently, within each of the aforementioned approaches, many new methods have emerged, and many researchers tend to combine two or more of these methods with better performance to address various classification or segmentation tasks [23,34]. Additionally, as networks become increasingly deeper, the extracted feature information also becomes richer. However, most existing semi-supervised segmentation methods struggle to clearly identify the features they need to learn, resorting to highly inefficient iterative learning processes. Additionally, they overlook a significant amount of spatial and channel-wise feature information, thus limiting their performance in unnoticed areas. While many researchers have previously studied and applied spatial and channel-wise information, achieving decent results, we believe that they have not fully utilized all spatial and channel-wise information. There is still abundant spatial and channel-wise information waiting to be explored and utilized in this regard. For the 3D medical dataset we are currently dealing with, it is particularly rich in spatial and channel information, and the connectivity relationship between the information is also more tight. We would like to extract more information on the spatial and channel directions to better improve the segmentation performance of the model.

3. Methods

Figure 1 illustrates the structure of the proposed BEAP-Net, which consists of a shared encoder and three slightly different decoders. The encoder incorporates our Supreme Channel Attention mechanism and Parallel Spatial Attention mechanism. We aim for the proposed network to maximally extract useful information in both spatial and channel domains. Meanwhile, in the decoders, we encourage the three variants to produce identical outputs for the same input. The predictions from the three decoders are subjected to cross-entropy loss, promoting them to learn additional informative features based on their differences from each other.

3.1. Supreme Channel Attention

The channel attention mechanism has been widely proven to effectively enhance the performance of deep convolutional neural networks (CNNs) and may also hold significant untapped potential. Some studies [35,36,37,38] aim to obtain more channel-related feature information by seeking more complex correlations between channels. However, this generally requires the addition of a large number of new parameters, inevitably increasing the complexity of the model. Balancing the cost of computation with the performance of the model is an issue we must consider. The ECA-Net is a very good method, and its structural diagram is shown in Figure 2. It first performs global average pooling on the input X to obtain aggregated features. Then, it conducts fast 1D convolutions on k adjacent channels to obtain the weight for each channel. Here, k is adaptively determined through a formula designed by the authors, as shown below:

k = {| \frac{{l o g}_{2}^{(C)}}{γ} + \frac{b}{γ} |}_{o d d}

(1)

The authors set γ and b to 2 and 1, respectively, in the experiments, where

{| t |}_{o d d}

denotes the nearest odd number to t. After activating the obtained channel weights with a sigmoid (

σ

) function, they are multiplied by the original X. This weighs the normalized weights obtained earlier onto the features of each channel. Consequently, the weights of the features in each channel of the module are no longer the same. This allows the network to pay more attention to channel features that are useful for the current task and suppress those that are not useful. The core idea of this network is “Learning the weights of feature channels by itself through global average pooling operation based on loss”. This method only adds minimal parameters and computational load, but its performance improvement is significant [32]. It has to be noted that the ECA module is almost a perfect method for 2D images, and it has been widely applied in 2D image segmentation and prediction research. However, when its dimension is modified to handle 3D medical images and applied in experiments with 3D medical image segmentation networks, the effect is not improved. This is because the ECA module has a significant drawback for 3D images. There is an operation in this module that adaptively generates convolution kernels based on a built-in formula. Through the formula calculation, the optimal kernel size adapted is 3. This is not a problem for 2D images, as these three channels are on the same image. However, when dealing with 3D images, each image size is H × W × D = 112 × 112 × 80. In other words, each 3D image consists of 80 2D slices. These 80 slices are not completely contiguous. The left atrial dataset used here divides a heart into 80 evenly spaced slices, with a distance of about 3 mm between each adjacent two slices, meaning there is a gap between every 2 adjacent slices. As the module is a channel attention mechanism, the convolution kernel size k is 3, so the three channels it covers may not necessarily be on the same slice, leading to errors due to these spatial gaps.

The schematic diagram of the proposed SCA module is shown in Figure 3. We fix the convolution kernel size k to 1. By processing slice by slice, we eliminate these spatial gaps and thereby avoid the errors they introduce. The results are promising, showing a significant improvement. While numerous studies have demonstrated the benefits of appropriate inter-channel interaction learning for training, it is more effective for 2D images than for 3D images. High-dimensional channels exhibit longer-range interactions, whereas low-dimensional channels have shorter-range interactions. For our 3D images, introducing extensive spatial gaps for minimal benefit leads to significant errors, potentially yielding results similar to those without these measures. Moreover, this approach unnecessarily increases computational complexity. Since the images are 3D, direct application is impractical. Therefore, we squeeze dimensions in X with average pooling, disregarding the last two dimensions set to 1. Note that this operation is not dimensionality reduction, which would disrupt the correspondence between channels and their weights, thereby degrading model performance. It is a temporary measure for convenience in subsequent processing stages, with the dimensions to be restored later without adverse effects, as this adjustment is solely for channel processing.

Additionally, the SCA module no longer utilizes global average pooling to obtain weights. The experiments have shown that adaptive average pooling often yields better results. We believe that average pooling can capture a portion of informative features, while max pooling can capture another useful portion of feature information. Therefore, we have adopted a parallel processing approach, utilizing adaptive average pooling on one path and adaptive max pooling on the other. After consecutive rapid convolutions with k = 1, we concatenate the results obtained from both sides along the channel dimension, followed by a 3D convolution operation. This approach merges the information features extracted from both sides, allowing the network to capture more potential channel features. Subsequently, the weights are activated using a sigmoid function, and the activated weights are multiplied with the original feature map X. As a result, the module will focus more on the channel features that are useful for the current task while suppressing those that are less relevant.

The SCABlock is stacked with Conv3d to form the SCAConvBlock as illustrated in Figure 4. This is a convolutional module with a channel attention mechanism, equivalent to the SCA block in Figure 1. We replace the traditional convolutional blocks with this module and run the network with this module in parallel with a standard 3D convolutional network pathway. This allows our network to extract more channel-specific information and dynamically weigh each channel to emphasize or suppress different feature channels according to the task requirements.

3.2. Parallel Spatial Attention

But due to certain limitations in spatial relationships with the Supreme Channel Attention mechanism, we designed the Parallel Spatial Attention (PSA) module to compensate for this aspect’s shortcomings. It generates spatial attention feature maps by analyzing the internal relationships within the feature maps. Unlike channel attention, spatial attention focuses on “where” the effective information is on the feature map. Previous spatial attention mechanisms typically chose one of average pooling or maximum pooling to obtain weights. But we believe that average pooling extracts the average value within each execution scope to obtain a spatially averaged information feature, while max pooling extracts the maximum value within each execution scope to obtain a spatially maximal information feature. These two are not completely decoupled; there is no need to choose one over the other. We have observed that features extracted by max pooling are as meaningful as those from average pooling, and we believe that these two types of features are complementary in certain respects. Simultaneously extracting both average pooling and max pooling feature information may enable the network to better understand the spatial information of different regions, leading to more accurate predictions. Subsequent ablation studies have successfully confirmed this. The proposed PSA module is illustrated in Figure 5. It performs separate average pooling and max pooling operations along the channel axis on the input feature blocks. It has been demonstrated [39] that pooling operations along the channel axis are helpful for capturing specific information within regions. Afterwards, the pooled feature maps are concatenated along the channel dimension. Subsequently, we utilize a convolutional operation with a 7 × 7 kernel size to fuse the information, and the shape of the feature map changes from [b, 2, h, w, d] to [b, 1, h, w, d]. At this point, we normalize the spatial weights of the feature map using the sigmoid function after convolution, then multiply this new weight with the input feature block. This results in new feature blocks with spatial-specific attention. This process can be represented by the following equation:

M_{s} (F) = σ (f^{7 \times 7} ([AvgPool (F); MaxPool (F)])) = σ (f^{7 \times 7} ([F_{a v g}^{s}; F_{m a x}^{s}]))

(2)

However, through multiple experiments, we have found that using such feature blocks not only does not provide any performance gain, but may actually degrade the network’s effectiveness. Indeed, the feature block that integrates two types of weights does pay greater attention to certain more useful parts, but it has diverged significantly from the original input feature block. This does not add attention but rather imposes a constraint. This may result in the loss of many original informational features. Therefore, we add the original input x to the final block that includes special attention. This way, the newly obtained final block not only incorporates spatial attention from both aspects but also retains the original feature information. Experiments have also demonstrated that this approach indeed yields better results.

Note that PSA module and SCA module are not connected in series. Although theoretically sound, experimental results have shown that the outcomes after processing the feature blocks successively through these two modules are not very satisfactory. We believe that the feature information focused on by these two modules is both similar and different, and when combined, the information becomes disordered, ultimately leading to such results. Therefore, this paper uses a dual-line parallel running setup. One path involves the input block passing through convolution operations with incorporated SCA mechanism, followed by downsampling to obtain feature blocks with channel-wise attention. The other path involves the input block undergoing regular convolution and downsampling operations, followed by processing through the PSA module to obtain feature blocks with spatial-wise attention. Afterward, we concatenate these two types of feature blocks along the channel dimension, followed by another 3D convolution operation. Thus, we obtain feature blocks incorporating attention from both aspects without altering their shapes. This entire process represents the operation of the first layer of BEAP-Net. In the second, third, and fourth layers of the network, only the PSA route is used. As for the rationale behind this decision, it is discussed in the fifth chapter’s Discussion section. Note that in the parallel network of the first layer, we utilize the feature blocks obtained through regular 3D convolution for skip connections, rather than using the feature blocks processed through the SCA module.

3.3. Triple Consistency Training

Our initial decoder is similar to that of 3D V-Net, performing upsampling through transpose convolution. Later, we added two additional slightly different decoders to the original one as auxiliary classifiers. One of the decoders replaces the original upsampling via transpose convolution with upsampling through trilinear interpolation followed by convolution. The other decoder extends the previously transferred feature maps using nearest-neighbor interpolation and then performs regular 3D convolution for upsampling. The three decoders share the same encoder as the previous one. We use these three slightly different decoders to approximate the model’s uncertainty in recognizing certain areas. The three decoders receive the same deep features

F_{e}

from the encoder and then generate three features

F_{A}

,

F_{B}

, and

F_{C}

. Afterwards, utilizing the sigmoid activation function, the probability outputs

P_{A}

,

P_{B}

, and

P_{C}

are obtained, respectively, from the deep features

F_{A}

,

F_{B}

, and

F_{C}

. This way, by employing three different decoders to enhance the diversity of the segmentation model, it is possible to effectively reduce overfitting of diverse features among different submodels. Compared to Monte Carlo dropout [40,41], the submodels of BEAP-Net are fixed and do not require additional perturbation during the training process.

We convert the prediction biases of the three decoders into auxiliary supervision signals to facilitate model training. Here, we have established a loop pseudo-labeling scheme. First, we use a sharpening function [42] to convert the probability outputs

P_{A}

,

P_{B}

, and

P_{C}

into soft pseudo-labels

{s P L}_{A}, {s P L}_{B}

, and

{s P L}_{C}

∈

{[0, 1]}^{H \times W \times D}

. Sharpening function is defined as

sPL = \frac{P^{1 / T}}{P^{1 / T} + {(1 - P)}^{1 / T}}

(3)

where T is a constant used to control the sharpening temperature. Using soft pseudo-labels not only helps with entropy-regularized training [24], but also, compared to pseudo-labels generated with fixed thresholds, soft pseudo-labels can eliminate the influence of mislabeled training data [42]. Then, we supervise the output results of decoders

{s P L}_{A}

,

{s P L}_{B}

, and

{s P L}_{C}

to each other during training, aiming to maintain mutual consistency for the same input [43]. In this way, the predictions of

P_{A}

,

P_{B}

, and

P_{C}

are made consistent and low-entropy. This consistency and entropy regularization enable the model to focus more on unlabeled and uncertain domains [23,44]. Our BEAP-Net’s training total loss function is obtained by weighting and adding the segmentation loss

L_{s e g}

and the consistency loss

L_{c}

. Specifically, it is as follows:

L_{s e g} = [D i c e (P_{A}, Y) + D i c e (P_{B}, Y) + D i c e (P_{C}, Y)] / 3

(4)

L_{c} = [L_{2} (P_{A}, {s P L}_{B}) + L_{2} (P_{A}, {s P L}_{C}) + L_{2} (P_{B}, {s P L}_{A}) + L_{2} (P_{B}, {s P L}_{C}) + L_{2} (P_{C}, {s P L}_{A}) + L_{2} (P_{C}, {s P L}_{B})] / 6

(5)

Total loss : Loss = L_{s e g} + λ \times L_{c}

(6)

where Dice represents the Dice loss,

L_{2}

is the Mean Squared Error (MSE) loss, and

Y

is the ground truth. The λ is the weight for both

L_{s e g}

and

L_{c}

. Note that

L_{s e g}

is computed based solely on annotated data, while

L_{c}

is used to supervise all training data. We combined the concepts of consistency training and entropy regularization by using a consistency loss to constrain the sub-models to produce consistent outputs for the same input. Predictions from three different decoders are used, and their prediction discrepancies are transformed into auxiliary supervisory signals to enhance model training. It is hoped that they can learn more latent information based on the differences between them [45], thereby achieving performance improvement.

4. Experiment

4.1. Implementation Details

We evaluated the proposed BEAP-Net on the LA database from the 2018 Left Atrium Segmentation Challenge [46]. This dataset consists of 100 labeled 3D gadolinium-enhanced left atrial magnetic resonance images. We used the first 80 of these samples for training and the remaining 20 samples for validation and testing. We strictly followed the experimental settings in MC-Net [12]. In the preprocessing stage, we first obtain 3D MR images with enlarged margins and then crop them to the target size as required. After that, we further normalize them to zero mean and unit variance. During training, we randomly crop the previously obtained images to patches of size 112 × 112 × 80. We also apply 2D random rotation and flipping operations for data augmentation. We set the batch size to 4, with each batch containing two labeled images and two unlabeled images. The temperature constant T is set to 0.1, and the weight λ is set as a Gaussian preheating function that varies over time [40]. Our BEAP-Net is trained by the SGD optimizer for 15K iterations, with an initial learning rate of 0.01 and decays by 10% every 2.5K iterations. During testing, we employed a sliding window with a fixed stride of 18 × 18 × 4 to extract features. Subsequently, we reassemble the predictions of all patches into a complete result. Finally, we use the average of A and B as the final output during the test period. All experiments in this paper were conducted with fixed random seeds in the following identical device environment. (Hardware: Intel Core i7-13700HX CPU, NVIDIA GEFORCE RTX4080 GPU; Software: Pytorch 1.13.1+cu117, and Python 3.9.16).

4.2. Results and Comparison with SOTA

We compared the proposed BEAP-Net with the following competitors on the LA database: UA-MT [43], SASSNet [47], DTC [7], DAP [48], LG-ER-MT [16], BCP [49], MC-Net [12], and DUWM [50]. For the sake of fairness, the experimental setup is the same as the recent methods like MC-Net: these methods and ours use the same data augmentation for all images, and the algorithm’s various parameter settings are identical. We used the first 80 samples in order for training, and the remaining 20 samples for validation. We used 10% and 20% of labeled data to train the model, meaning that out of the 80 training images, 8 (10%) and 16 (20%) images have labels, while the rest are unlabeled. The test used the last 20 images from the 100 LA images. Figure 6 shows the image segmentation results of the original image with SASSNet [47], MC-Net [12], BCP [49], and our BEAP-Net, with the corresponding ground truth on the LA database placed in the last column as a control. As shown in the figure, both in 3D view and in 2D view, our model generates a more complete left atrium than all existing SOTA methods. Note that we did not use any post-processing modules such as morphological algorithms to process the resultant maps.

We evaluate the model performance using four different metrics: Dice Similarity Coefficient (DSC), Jaccard Index, Hausdorff Distance 95 (HD95), and Average Symmetric Distance (ASD). Results for V-Net under various supervision settings (10%, 20%, all labeled data) are also provided for reference. As shown in Table 1, our method achieves the best performance on all four evaluation metrics, significantly outperforming the other 8 recent state-of-the-art (SOTA) methods. Moreover, our method achieves an excellent Dice score of 91.52% (0.96% higher than the second-best) using only 20% labeled data, thereby establishing new state-of-the-art segmentation performance on the LA database. To ensure fair comparison, these results do not undergo post-processing. The experiments demonstrate that BEAP-Net can extract more spatial and channel-wise information, thus achieving more precise segmentation. Specifically, the proposed algorithm has a similar runtime to eight recent algorithms. For example, when training 80 three-dimensional images, where 20% of the images are labeled, with a batch size of 4, and 15,000 iterations of training, BEAP-Net took 17 h, spending only an extra hour compared to BCP. However, it achieved a significant performance enhancement. Additionally, as shown in Figure 6, the proposed method can more accurately and finely segment the target organs, especially in areas that are prone to misidentification.

4.3. Ablation Study

We divide our approach into three parts and investigate the role of each part on the performance of network segmentation.

4.3.1. Effectiveness of Supreme Channel Attention

As shown in Figure 7, to validate the effectiveness of the proposed SCA module, we replaced the convolution blocks in the first and second layers of the V-Net encoder with the SCA module, while keeping the rest of the architecture the same as the original V-Net. We then evaluated it alongside the original V-Net on the previous LA database. All experimental settings were identical to those in Table 1, and the results are shown in Table 2. Because our primary focus is on semi-supervised learning methods, we only conducted comparative experiments using 10% and 20% labeled data, without conducting comparisons using 100% labeled data in a fully supervised manner. Additionally, we conducted experiments by incorporating SE channel attention mechanism [31] into the original V-Net. We treated this as a comparison to demonstrate the unique superiority of our SCA mechanism in 3D images. From the results, it can be observed that the network augmented with our SCA module consistently achieved superior performance in any supervision setting. Particularly, in the 20% supervision setting, the network with our SCA module attained a Dice score of 91.33%, which is very close to the 91.47% Dice score achieved by V-Net under full supervision. The experiments demonstrate that our SCA module indeed helps the network extract more channel-specific information and achieves a significant performance improvement.

4.3.2. Effectiveness of Parallel Spatial Attention

We also conducted ablation experiments on PSA module using a similar approach. As shown in Figure 8, we appended PSA module to the end of each layer of the V-Net encoder. We compared this network with the original V-Net under 10% and 20% supervision settings, using the V-Net with 100% labeled data as the reference. Similarly, we also placed a regular spatial attention module in the same position as the PSA module to compare and demonstrate the effectiveness of our approach. All experimental settings were identical to those in the previous ablation experiments.

As shown in Table 3, the network with the PSA module achieved the best performance across all supervision settings. The experiments demonstrated that the proposed PSA module can extract more detailed spatial information and enable more efficient training of the network.

4.3.3. Ablation Study of Triple Consistency Training

Yicheng Wu et al. [12] have demonstrated that in a semi-supervised setting, using two slightly different decoders (V2d-Net) yields better results compared to using two identical decoders (V2-Net). Building upon this, we conducted further experiments as shown in Table 4. We implemented three different decoders on V-Net (V3d-Net), applied the first two decoders out of our three (V2d-Net), and transformed our three different decoders into three instances of the original first decoder (V3-Net). The other configurations of these networks, like our triple decoder consistency jointly trained network, also encourage mutual training among their models. Experiments were conducted on the LA database, with all experimental parameters kept consistent and involving 15,000 iterations of training. The time required for V2d-Net, V3-Net, and V3d-Net was 13 h, 14.5 h, and 15 h, respectively. As shown in Table 4, using three slightly different decoders (V3d-Net) yields better performance than using two slightly different decoders (V2d-Net) or three identical decoders (V3-Net). This indicates that the configuration of three slightly different decoders, which are encouraged to train with consistency and learn from each other, although it slightly increases the training time, indeed captures more effective information. This approach helps the model make more accurate judgments in areas where recognition is unclear.

4.4. Robustness and Generalization

To further validate the robustness and generalization performance of the proposed BEAP-Net model, we conducted experiments on the Pancreas-NIH database [51]. The Pancreas-NIH dataset consists of 80 contrast-enhanced abdominal CT volumes. For a fair comparison, we followed the settings in CoraNet [52], employing data augmentation techniques such as random rotation and flipping. During training, images were randomly cropped to a size of 96 × 96 × 96 before being input into the network. Two experimental settings were designed with 10% and 20% labeled data. The batch size, initial learning rate, and number of iterations were set to 4, 0.01, and 15,000, respectively, with a learning rate decay of 10% every 2500 iterations. The proposed network was compared with several state-of-the-art algorithms in Table 5. All networks were tested under the same experimental conditions, and the results were reported without any post-processing. In the 20% labeled examples setting, the time taken by DTC, MC-Net, and BEAP-Net was 14 h, 16 h, and 16.5 h, respectively. As shown in Table 5, BEAP-Net achieved the highest Dice score under all supervision settings, demonstrating that BEAP-Net remains effective on other datasets and outperforms several top algorithms.

5. Discussion

Previously, attempts were made to place the PSA module before the final feature block extracted at each layer of the encoder, as depicted in Figure 9a; however, the experimental results were not satisfactory. We hypothesized that concatenating spatial and channel information together in the feature block might lead to ambiguous information needed, causing information disorder in the network. Subsequently, we adopted a dual-path approach where regular convolutions and SCA convolutions were run in parallel, and their outputs were concatenated (“cat”), followed by convolution along the channel dimension. The resulting feature was then processed through the PSA operation (as shown in Figure 9b). In other words, the PSA module was placed at the end of each layer of the network, and the modules processed through it served as inputs to the next layer of the network. Indeed, this approach yielded much better results than before.

The final structure of BEAP-Net is shown in Figure 1. We also considered applying the same network structure as in the first layer to the second, third, and fourth layers of the network. As shown in Figure 10, we attempted to apply this structure only in the first layer of the network (BEAP-Net), in the first and second layers of the network (try1), and in the first, second, and third layers of the network (try2).

We evaluated these three approaches on the LA database. The various experimental parameters remained completely consistent, and under 15,000 iterations, the time consumed by the three network architectures was approximately the same, around 16 h. The detailed results are presented in Table 6. It was found that although all three settings showed significant performance improvements compared to the non-improved version, applying this structure to the first two or three layers simultaneously did not yield better results than applying it only to the first layer. Specifically, the best experimental results were achieved when applying our structure only to the first layer. This improvement in the first layer demonstrates that this structure can indeed extract more useful spatial or channel-related information. When applying this structure to the second or third layers simultaneously, the network should theoretically extract more useful information, but the experimental results deteriorated. This suggests that the information extracted later did not contribute positively or may have negatively affected the information extracted by the first layer. We believe that with too many network parameters, the network may struggle to learn effectively. Perhaps it has to learn too much, making it unclear which information is most important. In conclusion, through various attempts, the network achieved the best segmentation results when applying this structure only to the first layer, as depicted in Figure 1, which constitutes our current BEAP-Net.

Of course, the proposed algorithm also has some limitations. First, the algorithm is currently most suited for 3D atrial MRI images. Its performance may not be as effective when applied to other areas, such as lung images, or in more blurred images like ultrasound. Second, the proposed method still struggles to achieve the level of fully supervised methods and is not yet suitable for practical use.

6. Conclusions

This paper introduces a Bidirectional Efficient Attention Parallel Network (BEAP-Net). It incorporates channel and spatial attention mechanisms specifically designed for 3D medical imaging: SCA and PSA. The experimental results demonstrate that the proposed PSA module effectively and comprehensively extracts spatial feature information, aiding the network in making more accurate segmentation decisions. The use of multiple decoders allows sub-models to learn from each other, enabling BEAP-Net to achieve more precise segmentation predictions in ambiguous regions. Notably, the proposed Supreme Channel Attention mechanism not only extracts a broader range of channel features but also addresses the inaccuracies caused by spatial gaps between slices in previous 3D medical image attention networks. This results in a significant improvement in network performance. BEAP-Net outperforms eight current popular algorithms and achieves the new state-of-the-art performance on both the LA dataset and the pancreatic CT dataset, demonstrating its ability to effectively utilize spatial and channel information. It also mitigates the errors introduced by other attention mechanisms due to the unaccounted spatial gaps between adjacent slices, making it particularly suitable for 3D medical datasets. Importantly, BEAP-Net can be applied to other 3D medical datasets with excellent results. Additionally, the proposed SCA and PSA modules can be seamlessly integrated into other 3D medical image segmentation networks to provide significant performance gains. The effectiveness of the SCA module, in particular, has been validated through ablation studies.

Author Contributions

Conceptualization, D.W.; methodology, T.X.; software, L.Y.; validation, J.G.; formal analysis, D.W.; investigation, J.L. (Jianshen Li); resources, J.L. (Jiehui Liu); data curation, J.L. (Jiehui Liu); writing—original draft preparation, T.X.; writing—review and editing, L.Y.; visualization, J.G.; supervision, D.W.; data analysis, D.W.; encoding and experiments, T.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

We will not release the data temporarily. Upon acceptance of the article, we will provide a link to access our data and relevant code.

Acknowledgments

We are grateful to our faculty mentors. This work was supported in part by a grant from the Hebei University of Engineering.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Ganin, Y.; Lempitsky, V. Unsupervised domain adaptation by backpropagation. In Proceedings of the International Conference on Machine Learning, PMLR, Montreal, QC, Canada, 11–12 December 2015; pp. 1180–1189. [Google Scholar]
Yang, J.; Dvornek, N.C.; Zhang, F.; Chapiro, J.; Lin, M.; Duncan, J.S. Unsupervised domain adaptation via disentangled representations: Application to cross-modality liver segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Shenzhen, China, 13–17 October 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 255–263. [Google Scholar]
Zhao, Z.; Xu, K.; Li, S.; Zeng, Z.; Guan, C. Mt-uda: Towards unsupervised cross-modality medical image segmentation with limited source labels. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 293–303. [Google Scholar]
Luo, X.; Chen, J.; Song, T.; Wang, G. Semi-supervised medical image segmentation through dual-task consistency. Proc. AAAI Conf. Artif. Intell. 2021, 35, 8801–8809. [Google Scholar] [CrossRef]
Luo, X.; Hu, M.; Song, T.; Wang, G.; Zhang, S. Semi-supervised medical image segmentation via cross teaching between cnn and transformer. In Proceedings of the International Conference on Medical Imaging with Deep Learning, Lübeck, Germany, 7–9 July 2021. [Google Scholar]
Bai, W.; Oktay, O.; Sinclair, M.; Suzuki, H.; Rajchl, M.; Tarroni, G.; Glocker, B.; King, A.; Matthews, P.M.; Rueckert, D. Semisupervised learning for network-based cardiac mr image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Quebec City, QC, Canada, 10–14 September 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 253–260. [Google Scholar]
Basak, H.; Ghosal, S.; Sarkar, R. Addressing class imbalance in semi-supervised image segmentation: A study on cardiac mri. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 224–233. [Google Scholar]
You, C.; Zhou, Y.; Zhao, R.; Staib, L.; Duncan, J.S. SimCVD: Simple contrastive voxel-wise representation distillation for semi-supervised medical image segmentation. IEEE Trans. Med. Imaging 2022, 41, 2228–2237. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; Xu, M.; Ge, Z.; Cai, J.; Zhang, L. Semi-supervised left atrium segmentation with mutual consistency training. In Medical Image Computing and Computer Assisted Intervention, Proceedings of the MICCAI 2021 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part II, volume 12902 of Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2021; p. 297306. [Google Scholar]
Lyu, F.; Ye, M.; Carlsen, J.F.; Erleben, K.; Darkner, S.; Yuen, P.C. Pseudo-label guided image synthesis for semi-supervised covid-19 pneumonia infection segmentation. IEEE Trans. Med. Imaging 2022, 42, 797–809. [Google Scholar] [CrossRef] [PubMed]
Seibold, C.M.; Reiß, S.; Kleesiek, J.; Stiefelhagen, R. Reference-guided pseudo-label generation for medical semantic segmentation. Proc. AAAI Conf. Artif. Intell. 2022, 36, 2171–2179. [Google Scholar] [CrossRef]
Jin, Q.; Cui, H.; Sun, C.; Zheng, J.; Wei, L.; Fang, Z.; Meng, Z.; Su, R. Semisupervised histological image segmentation via hierarchical consistency enforcement. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 3–13. [Google Scholar]
Hang, W.; Feng, W.; Liang, S.; Yu, L.; Wang, Q.; Choi, K.-S.; Qin, J. Local and global structure-aware entropy regularized mean teacher model for 3d left atrium segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Lima, Peru, 4–8 October 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 562–571. [Google Scholar]
Vesal, S.; Gu, M.; Kosti, R.; Maier, A.; Ravikumar, N. Adapt everywhere: Unsupervised adaptation of point-clouds and entropy minimization for multi-modal cardiac image segmentation. IEEE Trans. Med. Imaging 2021, 40, 1838–1851. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.; Wang, Y.; Tang, P.; Bai, S.; Shen, W.; Fishman, E.; Yuille, A. Semi-supervised 3d abdominal multi-organ segmentation via deep multi-planar co-training. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 121–140. [Google Scholar]
Wang, G.; Zhai, S.; Lasio, G.; Zhang, B.; Yi, B.; Chen, S.; Macvittie, T.J.; Metaxas, D.; Zhou, J.; Zhang, S. Semi-supervised segmentation of radiation-induced pulmonary fibrosis from lung ct scans with multi-scale guided dense attention. IEEE Trans. Med. Imaging 2022, 41, 531–542. [Google Scholar] [CrossRef] [PubMed]
Fan, D.-P.; Zhou, T.; Ji, G.-P.; Zhou, Y.; Chen, G.; Fu, H.; Shen, J.; Shao, L. Inf-Net: Automatic covid-19 lung infection segmentation from ct images. IEEE Trans. Med. Imaging 2020, 39, 2626–2637. [Google Scholar] [CrossRef] [PubMed]
French, G.; Laine, S.; Aila, T.; Mackiewicz, M.; Finlayson, G. Semi-supervised semantic segmentation needs strong, varied perturbations. arXiv 2019, arXiv:1906.01916. [Google Scholar]
Ouali, Y.; Hudelot, C.; Tami, M. Semisupervised semantic segmentation with cross-consistency training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12674–12684. [Google Scholar]
Sohn, K.; Berthelot, D.; Li, C.-L.; Zhang, Z.; Carlini, N.; Cubuk, E.D.; Kurakin, A.; Zhang, H.; Raffel, C. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Online, 6–12 December 2020; Volume 33, pp. 596–608. [Google Scholar]
Lee, D.H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Proceedings of the ICML 2013, Atlanta, GA, USA, 16–21 June 2013; Volume 3. [Google Scholar]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 2017, 30, 1195–1204. [Google Scholar]
Xie, Q.; Luong, M.T.; Hovy, E.; Le, Q.V. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10687–10698. [Google Scholar]
Wang, K.; Zhan, B.; Zu, C.; Wu, X.; Zhou, J.; Zhou, L.; Wang, Y. Tripled-uncertainty guided mean teacher model for semi-supervised medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 450–460. [Google Scholar]
Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent Models of Visual Attention. In Advances in Neural Information Processing Systems (NIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27, pp. 2204–2212. [Google Scholar]
Ba, J.; Mnih, V.; Kavukcuoglu, K. Multiple object recognition with visual attention. arXiv 2014, arXiv:1406.5679. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2048–2057. [Google Scholar]
Wang, Q.; Wu, B.; Wu, X.; Qiao, Y. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. Proc. Eur. Conf. Comput. Vis. (ECCV) 2018, 3, 3–19. [Google Scholar]
Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; Raffel, C.A. Mixmatch: A holistic approach to semi-supervised learning. Adv. Neural Inf. Process. Syst. 2019, 32, 5049–5505. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Xie, S.; Girshick, R.B.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar]
Chen, Y.; Kalantidis, Y.; Li, J.; Yan, S.; Feng, J. A2-Nets: Double attention networks. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
Gao, Z.; Xie, J.; Wang, Q.; Li, P. Global second-order pooling convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3242–3251. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5206–5215. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Yu, L.; Wang, S.; Li, X.; Fu, C.W.; Heng, P.A. Uncertainty-Aware Self-Ensembling Model for Semi-Supervised 3D Left Atrium Segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, 13–17 October 2019. [Google Scholar]
Kendall, A.; Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? arXiv 2017, arXiv:1703.04977. [Google Scholar]
Xie, Q.; Dai, Z.; Hovy, E.; Luong, M.T.; Le, Q.V. Unsupervised data augmentation for consistency training. arXiv 2019, arXiv:1904.12848. [Google Scholar]
Zhang, Y.; Xiang, T.; Hospedales, T.M.; Lu, H. Deep Mutual Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4320–4328. [Google Scholar]
Xia, Y.; Liu, F.; Yang, D.; Cai, J.; Yu, L.; Zhu, Z.; Xu, D.; Yuille, A.; Roth, H. 3D semi-supervised learning with uncertainty-aware multi-view cotraining. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 3646–3655. [Google Scholar]
Xiong, Z.; Xia, Q.; Hu, Z.; Huang, N.; Bian, C.; Zheng, Y.; Vesal, S.; Ravikumar, N.; Maier, A.; Yang, X.; et al. A global benchmark of algorithms for segmenting the left atrium from late gadolinium-enhanced cardiac magnetic resonance imaging. Med. Image Anal. 2021, 67, 101832. [Google Scholar] [CrossRef] [PubMed]
Laine, S.; Aila, T. Temporal ensembling for semi-supervised learning. arXiv 2016, arXiv:1610.02242. [Google Scholar]
Li, S.; Zhang, C.; He, X. Shape-Aware Semi-Supervised 3D Semantic Segmentation for Medical Images. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, 4–8 October 2020; Martel, A.L., Ed.; Springer: Berlin/Heidelberg, Germany; Volume 12261, pp. 455–463. [Google Scholar]
Zheng, H.; Lin, L.; Hu, H. Semi-supervised segmentation of liver using adversarial learning with deep atlas prior. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2019: 22nd International Conference, Shenzhen, China, 13–17 October 2019; Shen, D., Ed.; Springer: Cham, Switzerland, 2019; pp. 148–156. [Google Scholar] [CrossRef]
Bai, Y.; Chen, D.; Li, Q.; Shen, W.; Wang, Y. Bidirectional copy-paste for semi-supervised medical image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–19 June 2023; pp. 1234–1244. [Google Scholar]
Wang, Y.; Zhang, Y.; Tian, J.; Zhong, C.; Shi, Z.; Zhang, Y. Double-uncertainty weighted method for semi-supervised learning. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, 4–8 October 2020; Martel, A.L., Ed.; Springer: Berlin/Heidelberg, Germany; Cham, Switzerland, 2020; pp. 542–551. [Google Scholar] [CrossRef]
Roth, H.R.; Lu, L.; Farag, A.; Shin, H.-C.; Liu, J.; Turkbey, E.B.; Summers, R.M. Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation. In Medical Image Computing and Computer-Assisted Intervention, Proceedings of the MICCAI 2015 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part I, volume 9349 of Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2015; pp. 556–564. [Google Scholar]
Shi, Y.; Zhang, J.; Ling, T.; Lu, J.; Zheng, Y.; Yu, Q.; Qi, L.; Gao, Y. Inconsistency-aware uncertainty estimation for semi-supervised medical image segmentation. IEEE Trans. Med. Imaging 2022, 41, 608–620. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Network architecture of BEAP-Net.

Figure 2. The schematic diagram of the Efficient Channel Attention (ECA) model.

Figure 3. The schematic diagram of the Supreme Channel Attention (SCA) model.

Figure 4. SCAConvBlock.

Figure 5. Schematic diagram of the Parallel Spatial Attention (PSA) module.

Figure 6. The first column shows the original image, and columns 2 to 5 show the image segmentation results for SASSNet, MC-Net, BCP, and BEAP-Net on the LA database with two different supervision settings, respectively. The last column corresponds to the ground truth. And the results using 10% labeling and using 20% labeling are shown in the upper and lower parts of the figure, respectively.

Figure 7. V-Net+SCA (encoder part).

Figure 8. The ablation study model of the PSA module.

Figure 9. Two placement strategies of the PSA module: (a) and (b).

Figure 10. Applying the same architecture as the first layer to the first 2 layers of the network (try1); applying the same architecture as the first layer to the first 3 layers of the network (try2).

Table 1. Comparisons with eight state-of-the-art methods on the LA database.

Method	Labeled	Metrics
Method	Labeled	Dice ↑	Jaccard ↑	HD95 ↓ (Voxel)	ASD ↓ (Voxel)
V-Net	8 (10%)	79.99	68.12	21.11	5.48
V-Net	16 (20%)	86.10	76.09	13.29	3.19
V-Net	80 (All)	91.14	83.82	5.75	1.52
DAP [48]		81.88	71.23	15.81	3.79
UA-MT [43]		84.23	73.48	13.84	3.35
SASSNet [47]		87.31	77.72	9.61	2.55
LG-ER-MT [16]		85.53	75.12	13.29	3.77
DUWM [50]	8 (10%)	85.90	75.74	12.67	3.31
DTC [7]		86.57	76.55	14.47	3.74
MC-Net [12]		87.70	78.30	9.37	2.18
BEAP-Net (Ours)		88.95 ↑ 1.25	80.18 ↑ 1.88	8.90 ↓ 0.47	1.94 ↓ 0.24
DAP [48]		87.89	78.72	9.29	2.74
UA-MT [43]		88.87	80.21	7.32	2.26
SASSNet [47]		89.54	81.24	8.24	2.20
LG-ER-MT [16]		89.63	81.30	7.16	2.06
DUWM [50]	16 (20%)	89.65	81.33	7.04	2.03
DTC [7]		89.42	80.97	7.32	2.10
BCP [49]		90.56	82.69	6.03	1.71
MC-Net [12]		90.34	82.50	6.01	1.75
BEAP-Net (Ours)		91.52 ↑ 0.96	84.41 ↑ 1.72	5.08 ↓ 0.93	1.39 ↓ 0.32

LA dataset: https://www.cardiacatlas.org/atriaseg2018-challenge/atria-seg-data (accessed on 2 July 2024).

Table 2. Ablation study experiments of the SCA module.

Method	Labeled	Metrics
Method	Labeled	Dice ↑	Jaccard ↑	HD95 ↓ (Voxel)	ASD ↓ (Voxel)
V-Net	8 (10%)	82.72	71.73	13.33	3.27
V-Net	16 (20%)	86.10	76.09	13.29	3.19
V-Net	80 (All)	91.47	84.35	5.49	1.52
V-Net+SE	8 (10%)	86.04	76.06	14.25	3.51
V-Net+SE	16 (20%)	90.72	82.74	7.66	2.36
V-Net+SCA	8 (10%)	90.69	82.56	7.78	2.42
V-Net+SCA	16 (20%)	91.33	83.79	5.78	1.57

Table 3. Ablation study of the PSA module.

Method	Labeled	Metrics
Method	Labeled	Dice↑	Jaccard ↑	HD95 ↓ (Voxel)	ASD ↓ (Voxel)
V-Net	8 (10%)	82.72	71.73	13.33	3.27
V-Net	16 (20%)	86.10	76.09	13.29	3.19
V-Net	80 (All)	91.47	84.35	5.49	1.52
V-Net+SA	8 (10%)	85.06	75.03	14.35	3.91
V-Net+SA	16 (20%)	90.23	81.76	7.66	2.54
V-Net+PSA	8 (10%)	90.56	81.47	7.98	2.62
V-Net+PSA	16 (20%)	91.16	83.34	5.88	1.89

Table 4. Comparative experiments validating the effectiveness of triple decoder joint training.

Method	Labeled	Metrics
Method	Labeled	Dice ↑	Jaccard ↑	HD95 ↓ (Voxel)	ASD ↓ (Voxel)
V-Net	8 (10%)	82.72	71.73	13.33	3.27
V2d-Net	8 (10%)	85.78	75.40	14.45	3.84
V3-Net	8 (10%)	85.56	76.69	13.41	3.93
V3d-Net	8 (10%)	87.55	78.24	9.95	2.18
V2d-Net	16 (20%)	88.96	80.37	7.63	2.33
V3-Net	16 (20%)	86.64	79.69	12.07	2.75
V3d-Net	16 (20%)	90.79	83.21	5.79	1.97

Table 5. Comparison with four SOTA methods on the pancreatic CT database.

Method	Labeled	Metrics
Method	Labeled	Dice ↑	Jaccard ↑	HD95 ↓ (Voxel)	ASD ↓ (Voxel)
V-Net	6 (10%)	55.06	40.48	32.80	12.67
V-Net	12 (20%)	69.65	55.18	20.19	6.31
V-Net	62 (All)	83.01	71.35	5.18	1.19
UA-MT [43]	6 (10%)	68.70	54.65	13.89	3.23
SASSNet [47]	6 (10%)	66.52	52.23	17.11	2.25
DTC [7]	6 (10%)	66.27	52.07	15.00	4.43
MC-Net [12]	6 (10%)	68.94	54.74	16.28	3.16
BEAP-Net (Ours)	6 (10%)	68.99 ↑ 0.05	54.86 ↑ 0.12	14.02 ↑ 0.13	2.09 ↓ 0.16
UA-MT [43]	12 (20%)	76.77	63.77	11.41	2.79
SASSNet [47]	12 (20%)	77.12	64.24	8.93	1.41
DTC [7]	12 (20%)	78.27	64.75	8.37	2.27
MC-Net [12]	12 (20%)	79.05	65.82	10.29	2.71
BEAP-Net (Ours)	12 (20%)	79.11 ↑ 0.06	65.94 ↑ 0.12	8.50 ↑ 0.13	1.57 ↓ 0.16

All experiments were conducted on the same environments. Pancreas-CT: https://wiki.cancerimagingarchive.net/display/Public/Pancreas-CT (accessed on 2 July 2024).

Table 6. Comparative experiments of the three methods.

Method	Labeled	Metrics
Method	Labeled	Dice ↑	Jaccard ↑	HD95 ↓ (Voxel)	ASD ↓ (Voxel)
V-Net	80 (All)	91.47	84.36	5.48	1.51
BEAP-Net	8 (10%)	88.95	80.18	8.90	1.94
try1	8 (10%)	87.56	78.69	10.41	2.18
try2	8 (10%)	85.55	76.24	12.95	3.93
BEAP-Net	16 (20%)	91.52	84.41	5.08	1.39
try1	16 (20%)	91.34	84.40	5.37	1.48
try2	16 (20%)	91.27	83.99	5.75	1.73

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, D.; Xv, T.; Liu, J.; Li, J.; Yang, L.; Guo, J. Bidirectional Efficient Attention Parallel Network for Segmentation of 3D Medical Imaging. Electronics 2024, 13, 3086. https://doi.org/10.3390/electronics13153086

AMA Style

Wang D, Xv T, Liu J, Li J, Yang L, Guo J. Bidirectional Efficient Attention Parallel Network for Segmentation of 3D Medical Imaging. Electronics. 2024; 13(15):3086. https://doi.org/10.3390/electronics13153086

Chicago/Turabian Style

Wang, Dongsheng, Tiezhen Xv, Jiehui Liu, Jianshen Li, Lijie Yang, and Jinxi Guo. 2024. "Bidirectional Efficient Attention Parallel Network for Segmentation of 3D Medical Imaging" Electronics 13, no. 15: 3086. https://doi.org/10.3390/electronics13153086

APA Style

Wang, D., Xv, T., Liu, J., Li, J., Yang, L., & Guo, J. (2024). Bidirectional Efficient Attention Parallel Network for Segmentation of 3D Medical Imaging. Electronics, 13(15), 3086. https://doi.org/10.3390/electronics13153086

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bidirectional Efficient Attention Parallel Network for Segmentation of 3D Medical Imaging

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Supreme Channel Attention

3.2. Parallel Spatial Attention

3.3. Triple Consistency Training

4. Experiment

4.1. Implementation Details

4.2. Results and Comparison with SOTA

4.3. Ablation Study

4.3.1. Effectiveness of Supreme Channel Attention

4.3.2. Effectiveness of Parallel Spatial Attention

4.3.3. Ablation Study of Triple Consistency Training

4.4. Robustness and Generalization

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI