Next Article in Journal
Radiomics Analysis of Whole-Kidney Non-Contrast CT for Early Identification of Chronic Kidney Disease Stages 1–3
Previous Article in Journal
Putative Endoplasmic Reticulum Stress Inducers Enhance Triacylglycerol Accumulation in Chlorella sorokiniana
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

PE-MT: A Perturbation-Enhanced Mean Teacher for Semi-Supervised Image Segmentation

1
Wenzhou Third Clinical Institute Affiliated to Wenzhou Medical University, The Third Affiliated Hospital of Shanghai University, Wenzhou People’s Hospital, Wenzhou 325041, China
2
Ningbo Key Laboratory of Medical Research on Blinding Eye Diseases, Ningbo Eye Institute, Ningbo Eye Hospital, Wenzhou Medical University, Ningbo 315040, China
3
The Business School, The University of Sydney, Sydney 2006, Australia
4
National Engineering Research Center of Ophthalmology and Optometry, Eye Hospital, Wenzhou Medical University, Wenzhou 325027, China
5
School of Biomedical Engineering and Imaging Sciences, King’s College London, London WC2R 2LS, UK
6
National Clinical Research Center for Ocular Diseases, Eye Hospital, Wenzhou Medical University, Wenzhou 325027, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally as first authors.
Bioengineering 2025, 12(5), 453; https://doi.org/10.3390/bioengineering12050453
Submission received: 18 March 2025 / Revised: 16 April 2025 / Accepted: 22 April 2025 / Published: 25 April 2025
(This article belongs to the Section Biosignal Processing)

Abstract

:
The accurate segmentation of medical images is of great importance in many clinical applications and is generally achieved by training deep learning networks on a large number of labeled images. However, it is very hard to obtain enough labeled images. In this paper, we develop a novel semi-supervised segmentation method (called PE-MT) based on the uncertainty-aware mean teacher (UA-MT) framework by introducing a perturbation-enhanced exponential moving average (pEMA) and a residual-guided uncertainty map (RUM) to enhance the performance the student and teacher models. The former is used to alleviate the coupling effect between student and teacher models in the UA-MT by adding different weight perturbations to them, and the latter can accurately locate image regions with high uncertainty via a unique quantitative formula and then highlight these regions effectively in image segmentation. We evaluated the developed method by extracting four different cardiac regions from the public LASC and ACDC datasets. The experimental results showed that our developed method achieved an average Dice similarity coefficient (DSC) of 0.6252 and 0.7836 for four object regions when trained on 5% and 10% labeled images, respectively. It outperformed the UA-MT and can compete with several existing semi-supervised learning methods (e.g., SASSNet and DTC).

1. Introduction

The accurate segmentation of medical images, such as computed tomography and magnetic resonance imaging, is pivotal for clinical applications ranging from disease diagnosis to treatment planning [1,2,3]. Such segmentation can assist with the detection of regions of interest (ROIs) in medical images and the assessment of the morphological characteristics of the regions. A large number of image segmentation methods have been proposed in the last decade [4,5], and most of them are based on deep learning technology [6]. These deep learning-based methods have achieved remarkable segmentation performance in fully supervised settings, but their performance heavily relies on large-scale labeled image data. Manual pixel-level labeling, however, is labor-intensive and time-consuming, especially for medical images where expert knowledge is required. This greatly restricts the applications of these deep learning-based segmentation methods [7,8,9]. To alleviate the scarcity of labeled data, different deep learning technologies have been proposed to take advantage of unlabeled image data, such as self-supervised learning [10,11], semi-supervised learning [12,13,14], and weakly supervised learning [15,16]. Among these technologies, semi-supervised learning (SSL) has emerged as a promising paradigm, leveraging both labeled and unlabeled data to enhance model generalization.
Many SSL methods have been proposed to fully integrate a small amount of labeled data and a large amount unlabeled data for accurate image segmentation. For example, Tarvainen et al. [14] proposed the classical mean teacher (MT) method by using the exponential moving average (EMA) scheme to align a student model and a teacher model, which were enforced to have consistent predictions for unlabeled images. Wang et al. [17] improved the MT by introducing a unique model-level residual perturbation and an exponential Dice (eDice) loss. Yu et al. [18] proposed an uncertainty-aware mean teacher (UA-MT) method by using entropy uncertainty maps to filter out unreliable boundary prediction by the teacher model. Sukesh et al. [19] improved the UA-MT by using a pre-trained denoising auto-encoder (DAE) to generate uncertainty maps and reduce the overhead of computational resources. Li et al. [20] developed a multi-task deep learning network and introduced an adversarial loss between the predicted signed distance maps (SDM) of labeled and unlabeled data. Luo et al. [21] proposed a dual-task consistency semi-supervised method by explicitly establishing task-level regularity. Shi et al. [22] utilized different decoders to generate certainty and uncertainty object regions and help a student network to learn from them with different network weights. These semi-supervised segmentation methods have the potential to handle various medical images and obtain promising applications, but they may suffer from relatively large segmentation errors (especially in object boundary regions). This is probably due to the fact that (1) the EMA can lead to a tight coupling between the network weights of the student and teacher models, making the two models have very similar predictions for unlabeled images and thus suppressing the learning potential of the student model from the predictions of the teacher model. (2) The boundary regions of target objects are not effectively processed by the student and teacher models or existing uncertainty strategies in these semi-supervised methods, thus leading to relatively large segmentation errors.
In this paper, we developed a novel semi-supervised learning method (called PE-MT) for accurate image segmentation based on the UA-MT by introducing a perturbation-enhanced EMA (pEMA) and a residual-guided uncertainty map (RUM) to overcome the drawbacks of the traditional EMA and entropy uncertainty map (EUM). The pEMA was used to provide proper network weights for both the student and teacher models and alleviate the coupling effect between them via the modulus operator, while the RUM was used to highlight the unreliable prediction in the boundary regions of target objects leveraging a unique uncertainty quantitative formula and force the student model to focus on the other regions. With the two components, our developed method is expected to have reasonable potential to handle medical images with varying modalities and obtain promising segmentation performance, as compared to the UA-MT and several semi-supervised methods.

2. Method

2.1. Scheme Overview

Figure 1 shows the developed semi-supervised segmentation method by introducing the pEMA and RUM to improve the learning potential of the teacher and student models in the available UA-MT. The two models share the same network backbone (e.g., U-Net or V-Net), but their network weights are updated through distinct mechanisms. Specifically, the teacher’s weights are obtained using the student’s weights from different training steps through the pEMA, which not only enables the teacher model to capture the information learned by the students but also reduces the coupling between the teacher and student models. With the obtained weights, the teacher model can generate a prediction for each unlabeled image. These predictions are then thresholded by the RUM to filter out unreliable regions and used as pseudo-labels for unlabeled images. With these pseudo-labels, the student model can extract a great number of discriminative features from a small number of labeled images and a large number of unlabeled images for segmentation purposes, leveraging the supervised and unsupervised losses. Minimizing these two losses enables the student and teacher models to achieve very similar segmentation performance.

2.2. Semi-Supervised Segmentation

To minimize the supervised and unsupervised losses, we trained the developed semi-supervised method based on a training set consisting of N labeled images and M unlabeled images. The labeled and unlabeled images can be represented by S l = { ( x i , y i ) } i = 1 N and S u = { x i } i = 1 M , respectively, where x i and y i R H × W × D denote the involved image and label (i.e., ground truth) with specific dimensions of height H , width W , and depth D . With the images and labels, the total loss function for our developed method can be defined as follows:
L t o t a l = i = 1 N L s ( p i s ( x i , θ ) , y i ) + λ i = 1 N + M L u ( p i s ( x i , θ , η 1 ) , p i t ( x i , θ , η 2 ) )
L s = C E ( p i s ( x i , θ ) , y i ) + D i c e ( p i s ( x i , θ ) , y i ) = 1 | Ω | Ω y i log p i s ( x i , θ ) + 1 Ω 2 p i s ( x i , θ ) y i Ω p i s ( x i , θ ) + Ω y i
L u = 1 | Ω | Ω | p i s ( x i , θ , η 1 ) p i t ( x i , θ , η 2 ) | 2
where θ and θ denote the network weights of the student and teacher models, η 1 and η 2 denote small random noises added to labeled and unlabeled images, respectively. p i s ( x i , θ ) , p i s ( x i , θ , η 1 ) , and p i t ( x i , θ , η 2 ) indicate the predictions of image x i obtained by the student and teacher models under small random noises, η 1 , and η 2 , respectively. L s is the supervised loss, comprising the cross entropy (CE) and Dice loss functions [23], and | Ω | denotes the number of pixels in the image domain Ω . L u is the unsupervised losses and used to assess the consistency between predictions p i s ( x i , θ , η 1 ) and p i t ( x i , θ , η 2 ) based on the pixel-wise mean-squared error (MSE). λ is a scalar factor used to keep the balance between L s and L u and is often set to λ t = 0.1 e 5 1 t / t max 2 according to previous studies [14,17], where t and t max denote the current and maximum iteration number during network training, respectively. For simplification, the predictions of image x i obtained by the student and teacher models are represented by p i s and p i t , respectively.

2.3. The pEMA

The pEMA was derived from the EMA and used to provide a small weight perturbation for the student model so that it could obtain better accuracy and generalization capability in image segmentation. The EMA and pEMA can be separately given by
θ t = α θ t 1 + ( 1 α ) θ t
θ t = α θ t 1 + ( 1 α ) θ t θ t = θ t + β mod ( θ t , θ t )
where θ t * denotes the network weight of the teacher model obtained by the EMA based on the student’s weight θ t at the training step t . α and β are two different scalar factors. mod is the element-wise modulus operator. Based on the two formulas, it can be seen that in the original EMA, the student’s weight was obtained on a small number of labeled images and the teacher’s weight was merely derived from the student’s weights at different training steps. This calculation scheme made the teacher’s weight very similar to the student’s weight, thus limiting the efficient utilization of unlabeled images. Conversely, in the pEMA, the student model is updated based not only on labeled data at the current training step but also on a given residual perturbation between the student and teacher weights via the modulus operator. This can, to some extent, make the two models have different network weights and thus alleviate the coupling effect between them. On the other hand, the residual perturbation was closely associated with both the student and teacher weights and adaptively changed as the network was trained, which gave the pEMA the potential to improve the segmentation performance of the two models.

2.4. The RUM

The RUM was constructed based on multiple forward passes [24] of the teacher model under random image-level perturbation (e.g., dropout and noise) to show its prediction reliability for desirable objects depicted on unlabeled images. It can be given by
R U M υ = c = 1 C p ¯ c 1 p ¯ c υ p ¯ c
p ¯ c = 1 K k = 1 K p k , c t
where p k , c t denotes the k - t h forward pass of the teacher model for class c in unlabeled image x , and K and C are the total number of forward passes and classes. υ is a scalar coefficient and used to adjust the mean prediction probability p ¯ c in the RUM. With the unique quantitative formula, our uncertainty map had better capability to locate image regions with high uncertainty (especially boundary regions of target objects) and highlight the prediction unreliability for these regions, as compared to the original entropy uncertainty maps (EUMs) in the UA-MT, which was widely used in previous studies and defined as follows:
E U M = c = 1 C p ¯ c log p ¯ c
Figure 2 illustrates the differences between the RUM and EUM based on the prediction probability p of a pixel for a segmentation task with a class number of 2 (i.e., p for the desirable region, and ( 1 p ) for the background). According to Equations (7) and (8), the RUM and EUM have similar quantization curves and reach their corresponding maximum uncertainty value at a probability of 0.5 since a probability of 0.5 is often used to decide whether a pixel belongs to object regions or not in deep learning. However, the RUM has a larger maximum at a probability of 0.5 and its curve has a steeper slope, suggesting that our RUM can quickly and accurately locate uncertainty prediction regions and then exclude these regions in the unsupervised loss.
Based on the introduced RUM, we can enhance the consistency between the predictions of the student and teacher models for unlabeled images by filtering out image regions with high uncertainty in the unsupervised loss:
L u = Ω I R U M υ < τ | p s p t | 2 Ω I R U M υ < τ
where τ is a given uncertainty threshold and assigned to ( 0.75 + 0.25 e 5 ( 1 t / t max ) ) log ( 2 ) in the UA-MT, and I is a member function.

3. Experiments and Results

3.1. Dataset and Evaluation Metrics

In this study, we used the Left Atrial (LA) Segmentation Challenge (LASC) dataset [25] and the Automated Cardiac Diagnosis Challenge (ACDC) dataset [26] to validate the developed method. The LASC dataset consists of 100 3D gadolinium-enhanced MRI scans (GE-MRIs) and their corresponding segmentation labels, both of which have an isotropic resolution of 0.625 × 0.625 × 0.625 mm3. These GE-MRIs were normalized to zero mean and unit variance and divided into 80 scans for network training and 20 scans for performance validation, following previous studies [18]. The ACDC dataset contains both end-diastolic and systolic-phase short-axis cardiac cine-MRI scans of 100 patients and their corresponding segmentation masks for three different tissue regions, including left ventricle (LV), myocardium (Myo), and right ventricle (RV). These data were divided into 70 and 30 patients’ scans for network training and validation, respectively. Because of the large spacing between short-axis slices and the possible inter-slice shift caused by respiratory motion, we used U-Net to segment each slice separately, as recommended by previous studies [27]. Figure 3 illustrates the images and their corresponding labels from the LASC and ACDC datasets.
We used the available V-Net [8,18] and U-Net [7] as backbone networks for LA and cardiac segmentation, respectively, and assessed their performance [28] leveraging the Dice similarity coefficient (DSC), Jaccard coefficient (JAC), 95% Hausdorff Distance (HD), and average surface distance (ASD), all of which were available in the MedPy library (https://github.com/loli/medpy) (assessed on 27 November 2024) and defined as follows:
D S C = Ω 2 p y Ω p + y
J A C = Ω p y Ω p + y p y
H D = max h d p , y , h d y , p
A S D = 1 | S ( p ) | a S p min b S y | | a b | |
where p and y denote the prediction of a given image and its corresponding label, respectively. S is the set of surface voxels/pixels in an image. a b is the distance from point a to point b . h d p , y = max b S y min a S p | | a b | | is the directed HD from p to y . The DSC and JAC metrics are scored from 0 to 1, where higher values denote better segmentation accuracy. Conversely, ASD and HAD are distance-based metrics (measured in pixels), where values start above 0, and lower values correspond to smaller segmentation errors.

3.2. Implementation Details

We implemented the developed method via PyTorch (version 1.9.1) on a platform with an NVIDIA GeForce RTX 2080 SUPER GPU for two different segmentation tasks, based on public codes available from https://github.com/HiLab-git/SSL4MIS (assessed on 12 September 2024), and trained it three times with a fixed random seed, without any pretrained weights. During training, the network parameters were updated by the Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of 0.01 and a maximum iteration number of 6000. The learning rate was decayed by a factor of 0.1 every 2500 iterations. The batch sizes were set to 4 and 24 for the LASC and ACDC datasets, respectively, where the number of labeled and unlabeled images was equal. The parameters β and υ were set to 0.001 and 2, respectively. Other parameters were set as follows: α = min 1 1 / t ,   0.99 , K = 8 , and C = 2 , following previous configurations for the UA-MT. During training, we randomly cropped the LA region from the LASC dataset with dimensions of 112 × 112 × 80 voxels, resized ACDC slices to 256 × 256 pixels, and augmented these images randomly (e.g., rotation and flip) to avoid over-fitting. In addition, we compared the developed method with four semi-supervised segmentation methods (i.e., MT [14], UA-MT [18], SASSNet [20], and DTC [21]) to demonstrate its effectiveness and accuracy. For a fair comparison, the involved methods were based on the same backbones (i.e., V-Net and U-Net) and trained on two different proportions of labeled and unlabeled images from the training sets (n = 80 and 70) in the LASC and ACDC datasets to demonstrate their segmentation performance and reliance on labeled data. Specifically, 5% (10%) of the images in the training sets were used as labeled data and 95% (90%) of the images were used as unlabeled data for network training. After training, their performances were independently assessed based on the validation sets of the LASC and ACDC datasets, following previous studies [14,17,18].

3.3. Segmentation of the LASC Dataset

Table 1 demonstrates the results of the involved semi-supervised methods based on the V-Net backbone and the validation set of the LASC dataset for the LA segmentation. It can be seen from the results that (1) our developed method obtained an average DSC of 0.8341 and 0.8729 when trained on 5% and 10% labeled data, respectively. It outperformed the MT (0.7916 and 0.8631), UA-MT (0.8080 and 0.8648), SASSNet (0.8137 and 0.8623), and DTC (0.8067 and 0.8679) based on the same backbone and experimental dataset. This showed the advantages of the developed method in image segmentation over the other four semi-supervised methods. (2) All the involved methods had better segmentation performance than V-Net (0.5043 and 0.7610), which was solely trained on the involved labeled images in a fully supervised manner, suggesting the importance of unlabeled images in the semi-supervised learning framework. (3) These semi-supervised methods had an increased segmentation performance when trained on more labeled images and gradually approached the performance of V-Net trained on all the labeled images in a fully supervised manner. Figure 4 illustrates the segmentation results of the involved methods for four different images from the LASC dataset.

3.4. Segmentation of the ACDC Dataset

Table 2 shows the results of our developed method based on the U-Net backbone and the validation set for segmenting the RV, Myo, and LV regions from the ACDC dataset in the first experiment. As demonstrated by the results, our developed method had an increased performance in the semi-supervised segmentation framework when trained on more labeled images and could compete with the U-Net in the fully supervised framework. Specifically, our developed method had a average DSC of 0.4166, 0.5635, and 0.6864 for the RV, Myo, and LV, respectively, when trained on 5% labeled data, and 0.6199, 0.7932, and 0.8482 for the three regions when trained on 10% labeled data. It was superior to U-Net for three different objects on average when merely using 5% and 10% labeled data for network training, as shown in Table 2.
Table 3 summarizes the average segmentation results of the involved semi-supervised methods for three different experiments based on the U-Net backbone and ACDC dataset. As shown by the results, these semi-supervised methods had an improved segmentation performance on the validation set of the ACDC dataset when using more labeled images for network training, and they gradually approached the fully supervised results of U-Net trained on all the labeled images. However, they had very different capabilities in extracting three object regions from the ACDC dataset. Specifically, our developed method had an average DSC of 0.5555 and 0.7538 for three different regions (i.e., the LV, Myo, and RV) when trained on 5% and 10% labeled data, respectively. It was superior to the MT (0.5457 and 0.7483) and UA-MT (0.5383 and 0.7385) but inferior to the DTC (0.5601 and 0.7842) and SASSNet (0.5897 and 0.8108) under the same experiment conditions. Figure 5 illustrates the segmentation results of the involved methods for four different images from the ACDC dataset.

3.5. Ablation Study

3.5.1. Effect of the pEMA and RUM

Table 4 summarizes the impact of the pEMA and RUM on the performance of the UA-MT for two different segmentation tasks by using the two components to replace the EMA and EUM (note that UA-MT can be viewed as a combination of the EMA, EUM, student, and teacher models, while PE-MT was a variant of UA-MT created by introducing the pEMA and RUM). It can be seen that the UA-MT had a consistently increased average performance for different object regions depicted on the LASC and ACDC datasets when introducing the pEMA and RUM to replace their corresponding original versions (i.e., EMA and EUM). This suggested the effectiveness of our introduced pEMA and RUM, as compared to the EMA and EUM. Figure 6 shows the difference between the RUM and EUM in semi-supervised image segmentation. It can be seen that our introduced RUM can effectively identify and highlight unreliable prediction regions and suppress the adverse impact of background information far away from desirable objects, while the EUM detected lots of background regions, especially those close to object boundary regions.

3.5.2. The Parameters υ and β

Table 5 and Table 6 separately show the impact of the parameters υ and β in RUM and pEMA on the performance of the developed method in two specific segmentation experiments based on the LASC and ACDC datasets. As demonstrated by these results, our developed method achieved better overall performance when setting the parameter υ to 2 for both the LASC and ACDC datasets. Fixing this parameter value of 2, our developed method obtained higher accuracy when setting the parameter β to 0.001 for the same segmentation tasks, as shown in Table 6.

4. Discussion

In this paper, we proposed a novel semi-supervised learning method (PE-MT) based on the UA-MT and validated it by extracting multiple cardiac regions from the public LASC and ACDC datasets. The experimental results showed that our developed method can effectively extract desirable object regions by leveraging two available network backbones (i.e., V-Net and U-Net), and it obtained promising segmentation accuracy, owing to the introduction of the pEMA and RUM when trained on 5% (10%) labeled images and 95% (90%) unlabeled ones from the training sets. It was superior to the MT and UA-MT and could compete with the SASSNet and DTC when trained on the same number of labeled and unlabeled images from the LASC and ACDC datasets. Moreover, our methods tended to have increased segmentation accuracy when trained on more labeled and unlabeled images and was able to rapidly process an unseen image at the inference stage (around 1 s).
Our developed method was derived from the UA-MT and superior to it under the same experimental conditions. This was mainly attributed to the introduction of the RUM and pEMA. The RUM had a reasonable capability to accurately identify some regions with high uncertainty in the prediction maps of unlabeled images obtained by the teacher model. By eliminating these prediction regions of high uncertainty, the student and teacher models were able to highlight reliable prediction regions in the calculation of the unsupervised loss and thus improve the prediction accuracy and consistency of the student and teacher models for unlabeled images. This largely enhanced the segmentation potential of the two models and excluded the impact of irrelevant information on the final performance. Moreover, the performance can be further enhanced leveraging the introduced pEMA since it was able not only to provide proper network weights for the teacher model but also to increase the learning flexibility of the student model by adding a random weight perturbation to suppress the coupling effect between the two models. The learning flexibility can to some extent facilitate the detection of various object features and increase the use efficiency of label information.
Despite promising performance, our developed method was inferior to the SASSNet and DTC when extracting three different object regions from the ACDC datasets. This may be due to the fact that (1) our developed method merely employed the V-Net and U-Net to segment desirable objects and did not involve additional network branches or auxiliary learning tasks in image segmentation. In contrast, both the SASSNet and DTC used multiple network branches to simultaneously extract desirable objects and their corresponding signed distance maps in a mutually collaborative manner. This can enhance the learning procedure of specific neural networks due to the introduction of extra network parameters and auxiliary processing tasks and hence improve image segmentation accuracy. (2) V-Net and U-Net had limited learning capability and network parameters (see their structures at https://github.com/HiLab-git/SSL4MIS) (assessed on 12 September 2024) and could not capture enough convolutional features for segmentation purposes when trained on a very small number of labeled images (e.g., three and seven patients’ scans). (3) Labeled images were much less than unlabeled ones in segmenting the cardiac regions from the ACDC dataset, which may have lead to very large data distribution differences (or domain shifts) between the two types of images. These differences made our method subject to relatively severe performance degradation, as compared with the SASSNet and DTC.
Finally, there were some limitations to this study. First, our developed method was merely validated based on the plain network backbones (e.g., V-Net and U-Net), which had relatively limited learning capability, as compared with other deep learning architectures such as Transformers and multi-layer perceptrons (MLPs). This can largely suppress its segmentation performance and clinical application potential. Second, only a few data augmentation schemes (e.g., rotation and flip) were used in the segmentation experiments, potentially making our developed method have low accuracy in segmenting different medical images with varying modalities. Third, both the LASC and ACDC datasets contained a very small number of images and were further split into training and testing sets. This may have caused our developed method to be unable to capture various convolutional features associated with target objects, and thus, it underwent a rapid performance degradation when labeled images were reduced in the training set. Last but not least, our developed method was not validated for dynamical image segmentation [29], which aims to process multiple different images at multiple different instances of time or in multiple videos [30,31]. The incomplete performance validation not only limits the potential applications of the developed algorithm but also restricts its popularization. Despite these limitations, our model achieved promising segmentation performance on two public image datasets and surpassed the UA-MT under the same experimental configuration.

5. Conclusions

We developed a novel semi-supervised learning method (termed PE-MT) for accurate image segmentation based on a small number of labeled data and a large number of unlabeled data. Its novelty lies in the introduction of the pEMA and RUM and their integration with the available UA-MT. The pEMA extended the original EMA and added an adaptive weight perturbation to the student model in order to enhance its learning flexibility and effectiveness, while the RUM alleviated the drawbacks of the EUM in the UA-MT via a quantitative uncertainty formula and was used to filter out some prediction regions with high uncertainty. Extensive segmentation experiments on the public LASC and ACDC datasets demonstrated that the developed method was able to effectively extract desirable objects when trained on a small number of labeled images and a large number of unlabeled images and outperformed the MT and UA-MT under the same experimental configuration.

Author Contributions

Conceptualization, Q.Y. and L.W.; methodology, L.W.; software, W.W. (Wenquan Wang) and Z.L.; validation, X.Z., G.J. and Y.W.; formal analysis, W.W. (Wenquan Wang) and Z.L.; investigation, G.J., B.T. and S.Y.; resources, M.H. and X.X.; data curation, W.W. (Wencan Wu) and Q.Y.; writing—original draft preparation, W.W. (Wenquan Wang) and Z.L.; writing—review and editing, Q.Y. and L.W.; visualization, G.J. and B.T.; supervision, L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Wenzhou Municipal Science and Technology Bureau, grant number Y20240059, and the Natural Science Foundation of Zhejiang Province, grant number LQ21H040007.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding authors.

Acknowledgments

We would like to thank the anonymous reviewers for their helpful remarks that improved this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wang, Y.; Zhou, Y.; Shen, W.; Park, S.; Fishman, E.; Yuille, A. Abdominal multi-organ segmentation with organ-attention networks and statistical fusion. Med. Image Anal. 2019, 55, 88–102. [Google Scholar] [CrossRef] [PubMed]
  2. Luo, X.; Wang, G.; Song, T.; Zhang, J.; Zhang, S. MIDeepSeg: Minimally interactive segmentation of unseen objects from medical images using deep learning. Med. Image Anal. 2021, 72, 102102. [Google Scholar] [CrossRef] [PubMed]
  3. Wang, G.; Zuluaga, M.; Li, W.; Rosalind, P.; Patel, P.; Michael, A.; Tom, D.; Divid, A.; Jan, D.; Sebastien, O. DeepIGeoS: A deep interactive geodesic framework for medical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1559–1572. [Google Scholar] [CrossRef] [PubMed]
  4. Minaee, S.; Boykov, Y.; Porikli, F.; Plaza, A.; Kehtarnavaz, N.; Terzopoulos, D. Image segmentation using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3523–3542. [Google Scholar] [CrossRef] [PubMed]
  5. Jiao, R.; Zhang, Y.; Ding, L.; Xue, B.; Zhang, J.; Cai, R.; Jin, C. Learning with limited annotations: A survey on deep semi-supervised learning for medical image segmentation. Comput. Biol. Med. 2024, 169, 107840. [Google Scholar] [CrossRef] [PubMed]
  6. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [PubMed]
  7. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar] [CrossRef]
  8. Milletari, F.; Navab, N.; Ahmadi, S. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar] [CrossRef]
  9. Dong, B.; Wang, W.; Fan, D.; Li, J.; Fu, H.; Shao, L. Polyp-pvt: Polyp segmentation with pyramid vision transformers. arXiv 2021, arXiv:2108.06932. [Google Scholar] [CrossRef]
  10. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Online, 13–18 July 2020; pp. 1597–1607. [Google Scholar] [CrossRef]
  11. Grill, J.; Strub, F.; Altche, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Pires, B.; Guo, Z.; Azar, M. Bootstrap your own latent: A new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284. [Google Scholar] [CrossRef]
  12. Laine, S.; Aila, T. Temporal Ensembling for Semi-Supervised Learning. arXiv 2016, arXiv:1610.02242. [Google Scholar] [CrossRef]
  13. Yang, L.; Zhuo, W.; Qi, L.; Shi, Y.; Gao, Y. St++: Make self-training work better for semi-supervised semantic segmentation. In Proceedings of the Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4268–4277. [Google Scholar] [CrossRef]
  14. Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 2017, 30, 1195–1204. [Google Scholar] [CrossRef]
  15. Saleh, F.; Aliakbarian, M.; Salzmann, M.; Petersson, L.; Gould, S.; Alvarez, J. Built-in foreground/background prior for weakly-supervised semantic segmentation. In Proceedings of the ECCV, Amsterdam, The Netherlands, 11–14 October 2016; pp. 413–432. [Google Scholar] [CrossRef]
  16. Yang, R.; Song, L.; Ge, Y.; Li, X. BoxSnake: Polygonal Instance Segmentation with Box Supervision. In Proceedings of the International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 766–776. [Google Scholar] [CrossRef]
  17. Mei, C.; Yang, X.; Zhou, M.; Zhang, S.; Chen, H.; Yang, X.; Wang, L. Semi-supervised image segmentation using a residual-driven mean teacher and an exponential Dice loss. Artif. Intell. Med. 2024, 148, 102757. [Google Scholar] [CrossRef] [PubMed]
  18. Yu, L.; Wang, S.; Li, X.; Fu, C.; Heng, P. Uncertainty-aware self-ensembling model for semi-supervised 3D left atrium segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention, Shenzhen, China, 13–17 October 2019; pp. 605–613. [Google Scholar] [CrossRef]
  19. Adiga, S.; Dolz, J.; Lombaert, H. Leveraging labeling representations in uncertainty-based semi-supervised segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention, Singapore, 18–22 September 2022; pp. 265–275. [Google Scholar] [CrossRef]
  20. Li, S.; Zhang, C.; He, X. Shape-aware semi-supervised 3D semantic segmentation for medical images. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention, Lima, Peru, 4–8 October 2020; pp. 552–561. [Google Scholar] [CrossRef]
  21. Luo, X.; Chen, J.; Song, T.; Chen, Y.; Zhang, S. Semi-supervised medical image segmentation through dual-task consistency. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 8801–8809. [Google Scholar] [CrossRef]
  22. Shi, Y.; Zhang, J.; Ling, T.; Lu, J.; Zheng, Y.; Yu, Q.; Gao, Y. Inconsistency-aware uncertainty estimation for semi-supervised medical image segmentation. IEEE Trans. Med. Imaging 2021, 41, 608–620. [Google Scholar] [CrossRef] [PubMed]
  23. Zheng, Y.; Tian, B.; Yu, S.; Yang, X.; Yu, Q.; Zhou, J.; Jiang, G.; Zheng, Q.; Pu, J.; Wang, L. Adaptive boundary-enhanced Dice loss for image segmentation. Biomed. Signal Process. Control 2025, 106, 107741. [Google Scholar] [CrossRef] [PubMed]
  24. Kendall, A.; Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? Adv. Neural Inf. Process. Syst. 2017, 30, 5580–5590. [Google Scholar] [CrossRef]
  25. Xiong, Z.; Xia, Q.; Hu, Z.; Huang, N.; Zhao, J. A global benchmark of algorithms for segmenting the left atrium from late gadolinium-enhanced cardiac magnetic resonance imaging. Med. Image Anal. 2021, 67, 101832. [Google Scholar] [CrossRef] [PubMed]
  26. Bernard, O.; Lalande, A.; Zotti, C.; Cervenansky, F.; Yang, X.; Heng, P.; Cetin, I.; Lekadir, K.; Camara, O.; Ballester, M. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved? IEEE Trans. Med. Imaging 2018, 37, 2514–2525. [Google Scholar] [CrossRef] [PubMed]
  27. Bai, W.; Oktay, O.; Sinclair, M.; Suzuki, H.; Rajchl, M.; Tarroni, G.; Glocker, B.; King, A.; Matthews, P.; Rueckert, D. Semi-supervised learning for network-based cardiac MR image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention, Quebec City, QC, Canada, 10–14 September 2017; pp. 253–260. [Google Scholar] [CrossRef]
  28. Taha, A.; Hanbury, A. Metrics for evaluating 3D medical image segmentation: Analysis, selection, and tool. BMC Med. Imaging 2015, 15, 29. [Google Scholar] [CrossRef] [PubMed]
  29. Meyer, P.; Cherstvy, A.; Seckler, H.; Hering, R.; Blaum, N.; Jeltsch, F.; Metzler, R. Directedeness, correlations, and daily cycles in springbok motion: From data via stochastic models to movement prediction. Phys. Rev. Res. 2023, 5, 043129. [Google Scholar] [CrossRef]
  30. Zheng, Q.; Li, Z.; Zhang, J.; Mei, C.; Li, G.; Wang, L. Automated segmentation of palpebral fissures from eye videography using a texture fusion neural network. Biomed. Signal Process. Control 2023, 85, 104820. [Google Scholar] [CrossRef]
  31. Zheng, Q.; Zhang, X.; Zhang, J.; Bai, F.; Huang, S.; Pu, J.; Chen, W.; Wang, L. A texture-aware U-Net for identifying incomplete blinking from eye videography. Biomed. Signal Process. Control 2022, 75, 103630. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Overview of the developed semi-supervised segmentation method based on the UA-MT by introducing the unique PEMA and RUM schemes.
Figure 1. Overview of the developed semi-supervised segmentation method based on the UA-MT by introducing the unique PEMA and RUM schemes.
Bioengineering 12 00453 g001
Figure 2. Differences between the RUM and EUM based on the prediction probability of a voxel/pixel for all the classes, where the parameter υ was set to 1, 2, and 3, respectively.
Figure 2. Differences between the RUM and EUM based on the prediction probability of a voxel/pixel for all the classes, where the parameter υ was set to 1, 2, and 3, respectively.
Bioengineering 12 00453 g002
Figure 3. Illustration of images and labels in the LASC (top row) and ACDC (bottom row) datasets, respectively, where LA, Myo, LV, and RV denote the left atrium, myocardium, and left and right ventricles, respectively.
Figure 3. Illustration of images and labels in the LASC (top row) and ACDC (bottom row) datasets, respectively, where LA, Myo, LV, and RV denote the left atrium, myocardium, and left and right ventricles, respectively.
Bioengineering 12 00453 g003
Figure 4. Segmentation results of four given images obtained by the V-Net, MT, UA-MT, SASSNet, DTC, and PE-MT, respectively, which were trained on 10% (in the first two columns) and 5% (in the last two columns) of labeled images from the LASC dataset. The red lines represent the object boundaries of the LA labels, and the yellow arrows indicate the poor segmentation.
Figure 4. Segmentation results of four given images obtained by the V-Net, MT, UA-MT, SASSNet, DTC, and PE-MT, respectively, which were trained on 10% (in the first two columns) and 5% (in the last two columns) of labeled images from the LASC dataset. The red lines represent the object boundaries of the LA labels, and the yellow arrows indicate the poor segmentation.
Bioengineering 12 00453 g004
Figure 5. Segmentation results of four different images obtained by the U-Net, MT, UA-MT, SASSNet, DTC, and PE-MT, respectively, using 10% (in the first two columns) and 5% (in the last two columns) labeled images from the ACDC dataset. The red lines represent the ground-truth boundaries, and the yellow arrows indicate the poor segmentation.
Figure 5. Segmentation results of four different images obtained by the U-Net, MT, UA-MT, SASSNet, DTC, and PE-MT, respectively, using 10% (in the first two columns) and 5% (in the last two columns) labeled images from the ACDC dataset. The red lines represent the ground-truth boundaries, and the yellow arrows indicate the poor segmentation.
Bioengineering 12 00453 g005
Figure 6. From top to bottom, the labels of three given images and their corresponding uncertainty maps obtained by the RUM and EUM in each row are shown, respectively, where red circle highlighted irrelevant background regions.
Figure 6. From top to bottom, the labels of three given images and their corresponding uncertainty maps obtained by the RUM and EUM in each row are shown, respectively, where red circle highlighted irrelevant background regions.
Bioengineering 12 00453 g006
Table 1. The LA segmentation results on the validation set in terms of the average DSC, JAC, HD and ASD, leveraging the involved methods, which were trained on different proportions of labeled data and unlabeled images from the training set of the LASC dataset.
Table 1. The LA segmentation results on the validation set in terms of the average DSC, JAC, HD and ASD, leveraging the involved methods, which were trained on different proportions of labeled data and unlabeled images from the training set of the LASC dataset.
MethodNumber of ImagesMetrics
LabeledUnlabeledDSCJACHDASD
V-Net8000.91780.84854.71791.5867
V-Net400.50430.397236.369011.0264
MT4760.79160.663124.81497.0991
UA-MT4760.80800.686821.76726.5760
SASSNet4760.81370.692427.88148.0149
DTC4760.80670.685626.66787.5836
PE-MT4760.83410.722518.98365.0198
V-Net800.76100.652726.90734.8357
MT8720.86310.761217.97384.5731
UA-MT8720.86480.763816.71004.3400
SASSNet8720.86230.761213.11873.7583
DTC8720.86790.769211.64103.3986
PE-MT8720.87290.775813.10823.8202
Table 2. The cardiac segmentation results on the validation set in terms of the average DSC, JAC, HD, and ASD, leveraging the developed method and U-Net in the first experiment, which were trained on different proportions (i.e., 5% and 10%) of labeled data and unlabeled images from the training set of the ACDC dataset.
Table 2. The cardiac segmentation results on the validation set in terms of the average DSC, JAC, HD, and ASD, leveraging the developed method and U-Net in the first experiment, which were trained on different proportions (i.e., 5% and 10%) of labeled data and unlabeled images from the training set of the ACDC dataset.
DatasetMethodNumber of ImagesMetrics
LabeledUnlabeledDSCJACHDASD
RVU-Net300.39300.283663.119630.3970
PE-MT3670.41660.299862.217426.3911
U-Net700.63230.509624.02678.4186
PE-MT7670.61990.499418.47676.1613
MyoU-Net300.51450.398320.14856.9656
PE-MT3670.56350.443218.52947.0502
U-Net700.79430.67048.67462.2788
PE-MT7630.79320.66759.79172.9752
LVU-Net300.56070.443056.950621.5382
PE-MT3670.68640.581938.305013.7716
U-Net700.84030.742729.94378.5729
PE-MT7630.84820.751134.27639.3469
Table 3. The cardiac segmentation results on the validation set in terms of the average DSC, JAC, HD, and ASD, leveraging the involved semi-supervised methods and U-Net, which were trained on different proportions of labeled data and unlabeled images from the training set of the ACDC dataset for three experiments.
Table 3. The cardiac segmentation results on the validation set in terms of the average DSC, JAC, HD, and ASD, leveraging the involved semi-supervised methods and U-Net, which were trained on different proportions of labeled data and unlabeled images from the training set of the ACDC dataset for three experiments.
MethodNumber of ImagesMetrics
LabeledUnlabeledDSCJACHDASD
U-Net7000.88070.79366.47221.8963
U-Net300.48940.375046.739619.6336
MT3670.54570.433343.918517.3452
UA-MT3670.53830.427241.373616.0410
SASSNet3670.58970.475223.37888.5670
DTC3670.56010.451126.406111.1162
PE-MT3670.55550.441639.683915.7376
U-Net700.75560.640920.88176.4234
MT7630.74830.634020.23685.6540
UA-MT7630.73850.619921.06335.9992
SASSNet7630.81080.707412.38033.6314
DTC7630.78420.684210.10613.0190
PE-MT7630.75380.639320.84826.1611
Table 4. Performance of the UA-MT trained on 10% labeled data and 90% unlabeled data from the training set in the LASC and ACDC datasets by using the pEMA and RUM to replace the EMA and EUM, respectively.
Table 4. Performance of the UA-MT trained on 10% labeled data and 90% unlabeled data from the training set in the LASC and ACDC datasets by using the pEMA and RUM to replace the EMA and EUM, respectively.
DatasetMethodNumber of ImagesMetrics
LabeledUnlabeledDSCJACHDASD
LASCUA-MT8720.86480.763816.71004.3400
UA-MT + RUM8720.87240.775314.40203.7612
UA-MT + RUM + pEMA8720.87290.775813.10823.8202
ACDCUA-MT7630.73850.619921.06335.9992
UA-MT + RUM7630.74290.623725.21957.3287
UA-MT + RUM + pEMA7630.75380.639320.84826.1611
Table 5. Performance of the PE-MT when setting different values for the parameter υ in the RUM for two different segmentation tasks.
Table 5. Performance of the PE-MT when setting different values for the parameter υ in the RUM for two different segmentation tasks.
Dataset υ Number of ImagesMetrics
LabeledUnlabeledDSCJACHDASD
LASC18720.86150.758616.44573.9698
28720.87240.775314.40203.7612
38720.86310.762314.79833.7027
ACDC17630.72290.610921.06836.5155
27630.74290.623725.21957.3287
37630.72970.614225.64287.4772
Table 6. Performance of the PE-MT when setting different values for the parameter β in pEMA for two different segmentation tasks.
Table 6. Performance of the PE-MT when setting different values for the parameter β in pEMA for two different segmentation tasks.
Dataset β Number of ImagesMetrics
LabeledUnlabeledDSCJACHDASD
LASC0.0058720.74400.608421.59005.4993
0.0018720.87290.775813.10823.8202
0.00058720.85900.755017.75674.6438
0.00018720.86300.761618.61984.5289
ACDC0.0057630.70260.574631.824111.4408
0.0017630.75380.639320.84826.1611
0.00057630.72480.607722.92096.6283
0.00017630.74490.624625.70777.4602
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, W.; Li, Z.; Zhang, X.; Jiang, G.; Wu, Y.; Yu, S.; Tian, B.; Hu, M.; Xu, X.; Wu, W.; et al. PE-MT: A Perturbation-Enhanced Mean Teacher for Semi-Supervised Image Segmentation. Bioengineering 2025, 12, 453. https://doi.org/10.3390/bioengineering12050453

AMA Style

Wang W, Li Z, Zhang X, Jiang G, Wu Y, Yu S, Tian B, Hu M, Xu X, Wu W, et al. PE-MT: A Perturbation-Enhanced Mean Teacher for Semi-Supervised Image Segmentation. Bioengineering. 2025; 12(5):453. https://doi.org/10.3390/bioengineering12050453

Chicago/Turabian Style

Wang, Wenquan, Zhongwen Li, Xiaoyun Zhang, Gaoqiang Jiang, Yabo Wu, Shuchen Yu, Bihan Tian, Mingzhe Hu, Xiaomin Xu, Wencan Wu, and et al. 2025. "PE-MT: A Perturbation-Enhanced Mean Teacher for Semi-Supervised Image Segmentation" Bioengineering 12, no. 5: 453. https://doi.org/10.3390/bioengineering12050453

APA Style

Wang, W., Li, Z., Zhang, X., Jiang, G., Wu, Y., Yu, S., Tian, B., Hu, M., Xu, X., Wu, W., Yi, Q., & Wang, L. (2025). PE-MT: A Perturbation-Enhanced Mean Teacher for Semi-Supervised Image Segmentation. Bioengineering, 12(5), 453. https://doi.org/10.3390/bioengineering12050453

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop