Lightweight Denoising Diffusion Implicit Model for Medical Segmentation

Oh, Rina; Gonsalves, Tad

doi:10.3390/electronics14040676

Open AccessArticle

Lightweight Denoising Diffusion Implicit Model for Medical Segmentation

by

Rina Oh

^*

and

Tad Gonsalves

Department of Information and Computer Sciences, Sophia University, Chiyoda-ku, Tokyo 102-8554, Japan

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(4), 676; https://doi.org/10.3390/electronics14040676

Submission received: 17 January 2025 / Revised: 5 February 2025 / Accepted: 9 February 2025 / Published: 10 February 2025

(This article belongs to the Special Issue Artificial Intelligence in Image and Video Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Automatic medical segmentation is crucial for assisting doctors in identifying disease regions effectively. As a state-of-the-art (SOTA) approach, generative AI models, particularly diffusion models, have surpassed GANs in generating high-quality images for tasks like segmentation. However, most diffusion-based architectures rely on U-Net designs with multiple residual blocks and convolutional layers, resulting in high computational costs and limited applicability on general-purpose devices. To solve this issue, we propose an enhanced denoising diffusion implicit model (DDIM) that incorporates lightweight depthwise convolution layers within residual networks and self-attention layers. This approach significantly reduces computational overhead while maintaining segmentation performance. We evaluated the proposed DDIM on two distinct medical imaging datasets: X-ray and skin lesion and polyp segmentation. Experimental results demonstrate that our model achieves, with reduced resource requirements, accuracy comparable to standard DDIMs in both visual representation and region-based scoring. The proposed lightweight DDIM offers a promising solution for medical segmentation tasks, enabling easier implementation on general-purpose devices without the need for expensive high-performance computing resources.

Keywords:

AI; deep learning; diffusion model; DDIM; lightweight model; medical segmentation

Graphical Abstract

1. Introduction

The advancements in artificial intelligence (AI), particularly neural networks using deep learning, have revolutionized high-stakes tasks such as medical analysis. For instance, natural language processing (NLP) models like GPT have been applied to diagnose health conditions via chat applications [1], and generative AI has been used to discover a new design of drug [2]. Similarly, deep learning models handling medical images contribute significantly to tasks such as translating PET to CT [3], detecting oral cancer and classifying the risks [4], and segmenting cells [5,6].

The emergence of diffusion models has further advanced image generation tasks, providing a more stable training process compared to generative adversarial networks (GANs). Trained diffusion models, such as Stable Diffusion [7], have demonstrated the ability to generate high-quality outputs, which opens new possibilities for medical applications.

In the context of medical tasks utilizing diffusion models, refs. [8,9] employed denoising diffusion probabilistic models (DDPMs) to conduct organ segmentation and coloring from reference to MRI input images. Furthermore, Nicholas et al. [10] proposed SegGuidedDiff, which uses the DDIM algorithm to generate realistic MRI and CT images from input mask images. To enhance image quality further, researchers have introduced more complex denoising modules, such as frequency filters [8] and advanced multi-head attention mechanisms [11]. However, these sophisticated diffusion models often lead to high computational costs during training and require substantial GPU memory for storage. Such resource-intensive models are challenging to implement in practical medical settings, where hardware constraints often limit the feasibility of deploying large-scale models.

In this study, we employed the DDIM method due to its potential for fast and high-quality sampling. Our goal was to address the limitations of existing diffusion models by developing a lightweight DDIM capable of delivering comparable segmentation quality while reducing computational requirements. Through iterative experimentation, we arrived at a simple yet effective solution: replacing standard convolutional neural networks (CNNs) in each layer with grouped CNNs.

We evaluated our lightweight DDIM on three distinct medical segmentation tasks: (1) extracting lung regions from chest X-ray images, (2) segmenting melanoma from skin images, and (3) segmenting polyps from endoscopy images (irrespective of imaging modalities).

The contributions of our study can be summarized as follows:

By implementing grouped CNNs in residual blocks and self-attention layers, the trained model achieved segmentation results closely matching ground truth data.
In terms of quality metrics, our lightweight DDIM demonstrated high recall and dice scores, comparable to those of standard DDIM models. In addition, the runtime to output the predicted segment image was marked at 29.9% faster than standard DDIM.
The model’s file size was reduced by approximately 75% compared to the standard DDIM, thus significantly lowering storage requirements.

2. Related Studies

2.1. Image Segmentation

The prominent architecture for image segmentation, U-Net, employs skip connections between the encoder in the contracting path and the decoder in the expansive path [6]. In the contracting path, downsampling layers with convolutional operations extract semantic and contextual features. Conversely, in the expansive path, upsampling layers increase spatial resolution by concatenating feature maps from the corresponding stages in the contracting path. Reza et al. [12] provided a comprehensive survey of U-Net applications in medical segmentation tasks and introduced novel skip connection designs to propagate essential high-resolution contextual information effectively.

The success of U-Net in medical applications is evident in various studies. For example, Wufeng Liu et al. [13] enhanced U-Net by integrating a pre-trained EfficientNet-b4 as an encoder, demonstrating robustness in lung segmentation when referencing chest X-ray images. Similarly, Guoheng Huang et al. [14] proposed a channel-attention U-Net for segmenting the esophagus and esophageal cancer regions. Additionally, Qing Huang et al. [15] applied 3D U-Net to achieve automatic liver vessel extraction.

These examples illustrate the versatility and effectiveness of U-Net in medical image segmentation tasks. Furthermore, U-Net is increasingly utilized in image generation tasks, as detailed in Section 2.2.

2.2. Image Generation

U-Net is also widely used in image generation tasks through adversarial learning. Adversarial learning follows a min-max strategy, where a Generator produces realistic images while a Discriminator evaluates whether the input is from a real image dataset or from fake images that are generated by the Generator. The Generator aims to learn the data distribution of the image dataset to deceive the Discriminator into classifying its outputs as real [16,17]. U-Net serves as a Generator in image-to-image translation tasks, such as pix2pix [18] and CycleGAN [19].

While GAN-based methods have been extensively applied in generating segmentation maps, including medical images, they face challenges such as parameter sensitivity during initialization and optimization. These challenges often result in unstable adversarial learning.

Recently, diffusion models have emerged as a superior alternative for image generation. Models like DDPM [20] and DDIM [21] consist of a forward process that gradually add noise to data across timesteps, as well as utilize a reverse process that denoises the input to reconstruct the original image structure. Prafulla Dhariwal and Alex Nichol [22] demonstrated that diffusion models surpass state-of-the-art GANs in generating high-quality images on benchmarks such as ImageNet. Furthermore, successful applications of DDPM in medical segmentation, as shown in MedSegDiff [8] and MedSegDiff-v2 [9], highlight the potential of diffusion models in medical imaging tasks.

However, achieving high-quality outputs with diffusion models often entails large parameter sizes and substantial computational resources, including high-performance GPUs. Wei et al. [23] highlighted the limitations of diffusion models in resource-constrained and real-time scenarios, emphasizing the need for lightweight diffusion architectures.

Given this background, the development of lightweight diffusion models is crucial for medical applications, where the deployment of more expensive hardware is often restricted due to budget constraints. Hospitals and medical institutions, particularly those in resource-limited settings, face significant financial challenges, making it difficult to invest in high-performance GPUs or dedicated AI accelerators. This limitation hinders the adoption of advanced deep learning models that require substantial computational power. Our study addresses this gap by proposing a lightweight DDIM that maintains high segmentation performance while significantly reducing computational and storage requirements.

3. Our Lightweight DDIM

3.1. Overview

Diffusion models have demonstrated potential in medical segmentation tasks due to their ability to reconstruct high-quality outputs through iterative denoising processes. To explore this potential, we implemented a denoising model based on U-Net within the DDIM framework. Figure 1 illustrates the overall architecture of our proposed model, which consists of three main components: an encoder, a bottleneck, and a decoder.

In the encoder, concatenated input images are progressively downsampled through residual blocks and convolutional layers to extract hierarchical features. At each stage, feature maps are stored for concatenation in the corresponding decoder layers. The bottleneck includes a self-attention layer, which processes the feature maps from the final residual block to capture long-range dependencies. In the decoder, the stored feature maps are upsampled through multiple residual blocks, and the predicted noise

ϵ_{θ} (x_{t}, c, t)

is output, which is subsequently used in the reverse process to reconstruct the final segmentation.

The noisy input

x_{t}

is generated by blending random noise

ϵ

and the target segmentation image

x_{0}

, as described in Equation (1):

x_{t} = \sqrt{\bar{α_{t}}} \cdot x_{0} + \sqrt{1 - \bar{α_{t}}} \cdot ϵ,

(1)

\bar{α_{t}} = \prod_{s = 1}^{t} α_{s} .

(2)

As the timestep t increases,

x_{t}

becomes progressively noisier. The noisy input

x_{t}

, along with the conditional input c, is fed into the denoising U-Net

ϵ_{θ}

. During training,

ϵ_{θ}

minimizes the mean squared error (MSE) loss between

ϵ_{θ} (x_{t}, c, t)

and

ϵ

, as shown in Equation (3):

L (ϵ_{θ}) = \sum {∥ ϵ_{θ} (x_{t}, c, t) - ϵ ∥}_{2}^{2} .

(3)

In the reverse process, segmentation images are reconstructed iteratively from noise through the reverse denoising process. The sampling process is governed by Equation (4), which follows the DDIM framework:

x_{t - 1} = \sqrt{\bar{α_{t - 1}}} \cdot \frac{x_{t} - ϵ_{θ} (x_{t}, c, t)}{\sqrt{\bar{α_{t}}}} + \sqrt{1 - \bar{α_{t - 1}}} \cdot ϵ_{θ} (x_{t}, c, t),

(4)

where

\frac{x_{t} - ϵ_{θ} (x_{t}, c, t)}{\sqrt{\bar{α_{t}}}}

in the first term represents the prediction of

x_{0}

, and the second term captures the residual noise.

Figure 2 shows the reverse process from an input of random noise to generate segmentation. Through iterative steps with

t = [T, T - N, T - 2 * N, \dots 1]

(where N denotes the skipping interval), the segmentation image is progressively reconstructed from the predicted noise. Unlike DDPM, which relies on Markov chains for sampling, DDIM uses a deterministic sampling procedure, allowing for consistent segmentation predictions even with varying random noise inputs. Therefore, the time of generating predicted segmentation can be accelerated by setting large value of N.

3.2. Our Lightweight Residual Block and Self-Attention

Traditional diffusion models employ numerous residual blocks to address the gradient vanishing problem [24]. However, the extensive use of these blocks, coupled with standard self-attention mechanisms, significantly increases the computational cost and memory requirements of the model.

To mitigate these issues, we introduce lightweight grouped convolution layers. Grouped convolutions divide input feature maps into smaller groups based on channel size and perform convolutions independently within each group, reducing computational overhead and memory usage [25]. Figure 3 illustrates the architecture of our modified residual blocks and self-attention layers.

In our residual blocks, input feature maps with dimensionality D are processed through grouped convolutions, with the group parameter set to the smaller value between 32 and

D / 8

. Embedded time data

e m b (t)

is injected into intermediate layers to provide temporal context during training.

For self-attention, the Q (query), K (key), and V (value) matrices are computed using grouped convolutional layers. The dimensionality of each matrix is set to

D_{h e a d} = D / 4

. The grouped convolution parameter for calculating K and V is fixed at 32, further enhancing computational efficiency while preserving the effectiveness of the attention mechanism.

4. Experimental Setup

4.1. Dataset

In this study, we trained deep learning models to segment diseased regions from RGB medical images. As target medical images, we collected three different kinds of image datasets.

The first dataset is the chest X-ray dataset from [26]. This dataset provides paired images of chest X-ray images and segmented lung mask images, which are validated by radiologists. We used 371 images for training and 292 images for testing.

The second dataset consists of external skin images from [27]. This dataset is provided for developing automated melanoma diagnosis systems. It consists of skin images and corresponding diseased region mask images. We used 900 images for training and 379 images for testing.

The last dataset is the colonoscope image dataset from [28]. This dataset is a large-scale colonoscopy dataset with five kinds of imaging modalities (e.g., flexible spectral imaging color enhancement (FICE), white light imaging (WLI), etc.). We used 3150 images for training and 392 images for testing.

4.2. Setup in Training

During the training phase for segmentation using a deep learning model, we applied data augmentation techniques to introduce diversity into the training dataset, aiming to improve the model’s ability to handle unseen images. The techniques we applied are detailed below:

Random rotation around the center within a range of degrees [−15, 15].
Random horizontal and vertical flipping.

As the optimization function for our model, we employed the Adam optimizer [29], setting the learning rate to 0.00002 and

(β_{1}, β_{2}) = (0.9, 0.999)

. Our model was trained for 2000 epochs on each dataset with a batch size of 20 using a NVIDIA RTX A6000 GPU (48 GB approximately).

5. Results

5.1. Visualizing Predicted Segmentation

For evaluating the segmentation quality through visualizing using diffusion models, we fed five samples of random noise and obtained five segmentations of the candidates.

The segmentation results for the RSUA dataset are shown in Figure 4a. U-Net demonstrated basic segmentation capabilities, but its predictions often deviated significantly from the ground truth. MedSegDiff, leveraging its diffusion-based architecture, exhibited improved performance in delineating these regions. However, sometimes, the noise in the outputs highlighted stability issues. In contrast, both DDIM and Lightweight DDIM consistently achieved accurate and smooth segmentation results, closely matching the ground truth. Our Lightweight DDIM maintained performance similar to DDIM, despite its reduced computational complexity, proving its efficiency in handling grayscale medical imaging tasks.

For the ISBI 2016 dataset, the segmentation results, as visualized in Figure 4b, further emphasize the strengths of diffusion-based models. U-Net faced difficulty in adapting to the input images with strong coloration, often producing segmentation maps that failed to align with the target melanoma regions. MedSegDiff provided predictions that generally adhered to the target features. However it showed instability when handling large melanoma regions. By contrast, DDIM demonstrated consistently precise segmentation, even under challenging conditions that involved high noise during the reverse process. Similarly, Lightweight DDIM produced outputs closely aligned with the ground truth, holding spatial accuracy and shape consistency.

Segmentation results for the PolypDB dataset, depicted in Figure 4c, reflect the challenges of localizing multiple small or intricate target regions within endoscopic images. U-Net produced coarse and often imprecise segmentation maps, failing to capture smaller polyp structures. MedSegDiff exhibited a high degree of instability with many outputs misaligned. On the other hand, DDIM effectively segmented both large and small regions with accuracy, aligning closely with the ground truth. Lightweight DDIM, while simplified, retained its ability to handle intricate segmentation tasks, matching the ground truth effectively for most cases.

Overall, visualization of the results confirmed that the diffusion-based approaches, particularly DDIM and its lightweight variant, could perform well in medical segmentation tasks. Lightweight DDIM, despite its reduced computational demands, maintained comparable accuracy to DDIM.

5.2. Score Results: Recall and Dice Score

When outputting the segmentation using diffusion models, we set five samples of random noise as input and used the ensemble of these outputs for evaluation. In addition, we normalized each predicted segmentation to the range [0, 1] using min-max normalization. For evaluation, the threshold was set to 0.9, and this was applied per pixel in the normalized output, where a pixel was considered part of the segmented region if its value exceeded this threshold. And then we calculated the recall (which is calculated by

T P / (T P + F N)

) and dice score (which is calculated by

2 T P / (2 T P + F P + F N)

) with true positives (TPs), which are the regions where the predicted segmentation correctly overlapped with the target; false negatives (FN), which are the regions where the target is present but was not detected by the predicted segmentation; and false positives (FP), which are the regions where the predicted segmentation identified a region as positive but no corresponding target exists. A pixel was counted as TP if its value in the predicted segmentation exceeded the threshold and its corresponding pixel in the target segmentation was also positive.

Table 1 presents the evaluation results for the ISBI 2016 and PolypDB datasets across the trained models, showcasing their performance in terms of recall and dice scores. Notably, the proposed Lightweight DDIM model demonstrated a performance that closely approximated the standard DDIM while requiring significantly less computational resources.

For the ISBI 2016 dataset, which features centrally located segmentations, the standard DDIM achieved a dice score exceeding 91% with a recall score surpassing 90%. Lightweight DDIM, while slightly trailing at 88.16% in dice score and 84.50% in recall score, exhibited competitive results. Its ability to balance segmentation accuracy with reduced computational demands marks a significant advancement.

In the PolypDB dataset, which is characterized by diverse and intricate target regions, the standard DDIM achieved superior scores, with dice and recall values of 81.52% and 76.82%, respectively. Lightweight DDIM also performed commendably, recording dice and recall scores of 70.78% and 63.45%. This highlights its capability to handle challenging tasks, despite its simplified architecture.

Overall, the Lightweight DDIM model consistently delivered segmentation results on par with the standard DDIM, proving its practicality for resource-constrained environments. Its performance, as evident in Table 1, underscores its potential as a viable alternative, especially in applications where computational efficiency is critical.

5.3. The Memory Size and Time to Output

We compared the computational costs, file storage requirements, and inference time among the models. The results are shown in Table 2.

While the standard DDIM requires significant GPU memory and computational overhead due to its extensive use of multiple residual blocks and deeper network layers, Lightweight DDIM demonstrates remarkable improvements in resource efficiency. The GPU memory usage of Lightweight DDIM is 88.3 MB, which is 89.9% smaller than the 871.2 MB required by the standard DDIM. Similarly, the file size of Lightweight DDIM is reduced to 84.6 MB, achieving an 80.5% reduction compared to the 433 MB required by DDIM.

In addition to these reductions, Lightweight DDIM significantly optimizes inference time, requiring only 5.80 s to generate an image, which is 29.9% faster than the 8.27 s needed by DDIM. These results demonstrate that Lightweight DDIM not only maintains segmentation performance, but that it also substantially reduces resource demands, making it a more practical choice for applications in resource-constrained medical environments.

6. Discussion

6.1. Highlighting Our Lightweight DDIM Through Experiments

This study focused on evaluating segmentation performance across various types of medical images using models such as DDPM-based MedSegDiff, DDIM, and the proposed Lightweight DDIM. Our experiments revealed that both DDIM and Lightweight DDIM consistently generated stable segmentations, even when subjected to five samples of random noise. In contrast, the MedSegDiff model, which is based on a DDPM framework, often produced segmentations with significant variability, especially when tested on the PolypDB dataset.

The instability in the DDPM-based model’s segmentations can be attributed to its reverse process. This process relies on a Markov chain, where each timestep’s prediction influences the next. Moreover, noise is deliberately added at every step to introduce diversity. While this design enables stochastic outputs, it also leads to inconsistent and irregular segmentations, which are less desirable for medical imaging tasks.

On the other hand, DDIM-based models employ a non-Markovian reverse process, where noise is only used during the initial input phase and does not affect subsequent predictions. This deterministic approach allows the reverse process to consistently aim at reconstructing

x_{0}

by the direction of

\frac{x_{t} - ϵ_{θ} (x_{t}, c, t)}{\sqrt{\bar{α_{t}}}}

.

As a result, DDIM-based models can generate more stable and shape-consistent segmentations, aligning closely with ground truth targets.

In medical imaging, where precision and consistency are paramount, models that produce reliable and reproducible segmentations are preferred. Therefore, DDIM-based models are inherently more suited for medical segmentation tasks. Furthermore, our proposed Lightweight DDIM not only inherits these advantages, but it also significantly reduces file size and computational costs when compared to the standard DDIM. This efficiency makes it an ideal solution for deployment in resource-constrained environments, such as rural clinics or portable medical devices.

6.2. Limitations of Our Lightweight DDIM

Despite the promising results, our proposed Lightweight DDIM has some limitations. For instance, in certain cases, it produces less stable segmentation outputs compared to the standard DDIM, particularly when tested on the PolypDB dataset. Figure 5 shows that the predictions that were obtained using our Lightweight DDIM were different from the ground truth.

The instability observed with PolypDB may be attributed to the dataset’s inherent complexity. Unlike RSUA and ISBI2016, where consistent features such as uniform lung shapes or centrally aligned melanoma regions exist, PolypDB includes diverse disease region shapes, positions, and imaging conditions. Some images exhibit dark backgrounds, while others are affected by overexposure due to camera flash, leading to inconsistent contextual information. This high variability poses challenges for segmentation models.

Additionally, the use of grouped convolutional neural networks (Grouped CNNs) in Lightweight DDIM may contribute to these issues. Grouped CNNs divide input feature maps into smaller groups for independent convolutional operations. While this approach reduces computational costs, it may fail to effectively capture the global spatial information necessary for accurately segmenting complex regions.

Moreover, the current self-attention mechanism in our Lightweight DDIM focuses solely on position-based attention. Incorporating a dual-attention mechanism could enhance the model’s ability to capture intricate relationships across both spatial and channel dimensions [30]. This improvement might address the segmentation challenges posed by datasets with high variability, such as PolypDB.

7. Conclusions

In this study, we demonstrated the significant potential of our Lightweight DDIM model in extracting diseased regions through the segmentation of both external and internal medical images. By introducing a lightweight architecture that employs grouped convolutional layers and a simplified residual structure, the proposed Lightweight DDIM achieved comparable segmentation accuracy to the standard DDIM while significantly reducing computational costs and file sizes. Notably, our Lightweight DDIM produces consistent and reliable segmentations across datasets, even under resource-constrained conditions, highlighting its adaptability for real-world medical applications.

The experimental results showcase the superior performance of our model in terms of dice and recall scores when compared to other state-of-the-art segmentation models, such as U-Net and MedSegDiff. Additionally, the model’s ability to maintain segmentation stability, even with noise variations during the reverse process, underscores its robustness and suitability for diverse medical imaging scenarios. The substantial reduction in GPU memory usage and inference time achieved by the Lightweight DDIM further emphasizes its practicality for deployment in environments with limited computational resources.

However, the reliance on paired datasets consisting of raw medical images and their corresponding segmentation labels presents a notable limitation. The process of collecting such datasets is both time-intensive and expertise-driven, often requiring manual annotations by medical professionals. Addressing this challenge is critical for expanding the applicability of Lightweight DDIM. Developing an architecture capable of learning from unpaired datasets would represent a significant advancement, reducing dependency on paired data while enhancing training efficiency and flexibility.

Future research will focus on adapting the Lightweight DDIM for unpaired segmentation tasks by leveraging techniques from unpaired image-to-image translation. CycleGAN, which employs adversarial and cycle consistency losses to retain target domain features while ensuring fidelity between source and reconstructed images, provides an inspiring foundation. Similarly, UNIT-DDPM, which integrates diffusion probabilistic models with consistency loss for handling random noise, offers promising insights for unpaired training [31]. Building upon these approaches, we aim to extend Lightweight DDIM to learn effectively from unpaired datasets, enabling its application to a broader range of medical imaging tasks.

Furthermore, improving the model’s robustness to blurry or low-resolution input images is crucial for ensuring reliable segmentation performance in diverse clinical scenarios. Medical images often suffer from degradation due to factors such as motion artifacts, imaging noise, and limited hardware resolution, which can significantly impact segmentation accuracy. To address this, previous research has explored diffusion-based image restoration techniques, such as Real-SRGD (real-world image super-resolution with classifier-free guided diffusion) [32], which employs classifier-free guidance to reduce image degradation, effectively restoring high-frequency details in low-quality images. By integrating a pre-trained Real-SRGD model as a preprocessing step before segmentation, we can enhance input image quality, thereby improving the accuracy and consistency of our Lightweight DDIM model. Future work will explore the optimal integration of Real-SRGD with our segmentation framework, assessing its impact on both computational efficiency and segmentation performance.

Additionally, enhancing the model’s robustness against noise and adversarial attacks is essential for ensuring reliability in real-world medical applications. Adversarial examples, which introduce small but carefully crafted perturbations to input images, can significantly degrade the performance of deep learning models by leading them to incorrect predictions [33]. In medical imaging, such perturbations could result in critical misdiagnoses, making it imperative to develop defense mechanisms against them. Future improvements will focus on integrating adversarial training techniques and noise-resistant architectures, such as denoising diffusion models, to strengthen Lightweight DDIM’s resistance to perturbations. Additionally, exploring self-supervised learning and contrastive learning approaches could further enhance the model’s ability to generalize under noisy and adversarial conditions.

By addressing the current limitations and incorporating advanced learning strategies, Lightweight DDIM has the potential to become a transformative tool in medical image segmentation. Its lightweight and efficient design positions it as a viable solution for widespread use, even in settings with limited computational resources, while paving the way for more accessible and scalable medical imaging solutions in the future.

Author Contributions

Writing—original draft, R.O.; Writing—review & editing, T.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hirosawa, T.; Kawamura, R.; Harada, Y.; Mizuta, K.; Tokumasu, K.; Kaji, Y.; Suzuki, T.; Shimizu, T. ChatGPT-generated differential diagnosis lists for complex case–derived clinical vignettes: Diagnostic accuracy evaluation. JMIR Med. Inform. 2023, 11, e48808. [Google Scholar] [CrossRef]
Bian, Y.; Xie, X.Q. Generative chemistry: Drug discovery with deep learning generative models. J. Mol. Model. 2023, 27, 1–18. [Google Scholar] [CrossRef] [PubMed]
Armanious, K.; Jiang, C.; Fischer, M.; Küstner, T.; Nikolaou, K.; Gatidis, S.; Yang, B. MedGAN: Medical image translation using GANs. Comput. Med. Imaging Graph. 2020, 79, 101684. [Google Scholar] [CrossRef] [PubMed]
Welikala, R.A.; Remagnino, P.; Lim, J.H.; Chan, C.S.; Rajendran, S.; Kallarakkal, T.G.; Zain, R.B.; Jayasinghe, R.D.; Rimal, J.; Kerr, A.R.; et al. Automated detection and classification of oral lesions using deep learning for early detection of oral cancer. IEEE Access 2020, 8, 132677–132693. [Google Scholar] [CrossRef]
Lu, Y.; Qin, X.; Fan, H.; Lai, T.; Li, Z. WBC-Net: A white blood cell segmentation network based on UNet++ and ResNet. Appl. Soft Comput. 2021, 101, 107006. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Proceedings, Munich, Germany, 5–9 October 2015; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Wu, J.; Fu, R.; Fang, H.; Zhang, Y.; Yang, Y.; Xiong, H.; Liu, H.; Xu, Y. MedSegDiff: Medical Image Segmentation with Diffusion Probabilistic Model. In Medical Imaging with Deep Learning; PMLR: Cambridge, MA, USA, 2024; pp. 1623–1639. [Google Scholar]
Wu, J.; Ji, W.; Fu, H.; Xu, M.; Jin, Y.; Xu, Y. MedSegDiff-v2: Diffusion-based medical image segmentation with transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6030–6038. [Google Scholar]
Konz, N.; Chen, Y.; Dong, H.; Mazurowski, M.A. Anatomically-controllable medical image generation with segmentation-guided diffusion models. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Marrakesh, Morocco, 6–10 October 2024; Springer Nature Switzerland: Cham, Switzerland, 2024; pp. 88–98. [Google Scholar]
Chen, T.; Wang, C.; Chen, Z.; Lei, Y.; Shan, H. HiDiff: Hybrid diffusion framework for medical image segmentation. IEEE Trans. Med. Imaging 2024, in press. [Google Scholar] [CrossRef]
Azad, R.; Aghdam, E.K.; Rauland, A.; Jia, Y.; Avval, A.H.; Bozorgpour, A.; Karimijafarbigloo, S.; Cohen, J.P.; Adeli, E.; Merhof, D. Medical Image Segmentation Review: The Success of U-Net. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10076–10095. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Luo, J.; Yang, Y.; Wang, W.; Deng, J.; Yu, L. Automatic lung segmentation in chest X-ray images using improved U-Net. Sci. Rep. 2022, 12, 8649. [Google Scholar] [CrossRef] [PubMed]
Huang, G.; Zhu, J.; Li, J.; Wang, Z.; Cheng, L.; Liu, L.; Li, H.; Zhou, J. Channel-attention U-Net: Channel attention mechanism for semantic segmentation of esophagus and esophageal cancer. IEEE Access 2020, 8, 122798–122810. [Google Scholar] [CrossRef]
Huang, Q.; Sun, J.; Ding, H.; Wang, X.; Wang, G. Robust liver vessel extraction using 3D U-Net with variant dice loss function. Comput. Biol. Med. 2018, 101, 153–162. [Google Scholar] [CrossRef] [PubMed]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. In Proceedings of the 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016; pp. 1–16. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. In Proceedings of the Ninth International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021; pp. 1–21. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion models beat GANs on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
Song, W.; Ma, W.; Zhang, M.; Zhang, Y.; Zhao, X. Lightweight diffusion models: A survey. Artif. Intell. Rev. 2024, 57, 161. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Wahyuningrum, R.T.; Yunita, I.; Bauravindah, A.; Siradjuddin, I.A.; Satoto, B.D.; Sari, A.K.; Sensusiati, A.D. Chest X-ray dataset and ground truth for lung segmentation. Data Brief 2023, 51, 109640. [Google Scholar] [CrossRef] [PubMed]
Gutman, D.; Codella, N.C.F.; Celebi, E.; Helba, B.; Marchetti, M.; Mishra, N.; Halpern, A. Skin lesion analysis toward melanoma detection: A challenge at the International Symposium on Biomedical Imaging (ISBI) 2016, hosted by the International Skin Imaging Collaboration (ISIC). arXiv 2016, arXiv:1605.01397. [Google Scholar]
Jha, D.; Tomar, N.K.; Sharma, V.; Trinh, Q.H.; Biswas, K.; Pan, H.; Jha, R.K.; Durak, G.; Hann, A.; Varkey, J.; et al. PolypDB: A curated multi-center dataset for development of AI algorithms in colonoscopy. arXiv 2024, arXiv:2409.00045. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Sasaki, H.; Willcocks, C.G.; Breckon, T.P. Unit-ddpm: Unpaired image translation with denoising diffusion probabilistic models. arXiv 2021, arXiv:2104.05358. [Google Scholar]
Doi, K.; Okada, S.; Yoshihashi, R.; Kataoka, H. Real-SRGD: Enhancing Real-World Image Super-Resolution with Classifier-Free Guided Diffusion. In Proceedings of the Asian Conference on Computer Vision, Hanoi, Vietnam, 8–12 December 2024; pp. 3739–3755. [Google Scholar]
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. In Proceedings of the Third International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015; pp. 1–11. [Google Scholar]

Figure 1. Overview of the proposed DDIM model (while in the training phase).

Figure 2. Overview of the reverse process using the proposed denoising model.

Figure 3. Architecture of (a) residual blocks and (b) self-attention layers.

Figure 4. Visualizing the predicted segmentation among trained models: (a) segmentation results using RSUA, (b) segmentation results using ISBI 2016, (c) and segmentation results using PolypDB.

Figure 5. Comparison of the segmentation outputs from the PolypDB dataset using DDIM and Lightweight DDIM. (a,b): Lightweight DDIM output unstable segmentations with noticeable deviations from the ground truth, while DDIM demonstrated more consistent results. (c): Both DDIM and Lightweight DDIM failed to accurately segment the target region.

Table 1. Comparing the recall and dice scores of the trained models.

	RSUA		ISBI2016		PolypDB
	Recall	Dice	Recall	Dice	Recall	Dice
U-Net	0.0263	0.0258	0.5714	0.5903	0.5724	0.6249
MedSegDiff	0.8291	0.8950	0.5923	0.6590	0.1831	0.2048
DDIM ¹	0.9698	0.9760	0.9063	0.9131	0.7682	0.8152
Lightweight DDIM (our) ¹	0.9427	0.9693	0.8450	0.8816	0.7161	0.7681

¹ Setting sampling time step = 50 in reverse process.

Table 2. The computational costs, file sizes, and the time to output segmented images among the trained models.

	U-Net	MedSegDiff	DDIM ¹	Lightweight DDIM (Our) ¹
Computational cost (MB)	96.6	226.6	871.2	88.3
File size (MB)	92.8	226.0	433.0	84.6
Time (sec)	0.22	356.16	8.27	5.80

¹ Setting sampling time step = 50 in reverse process.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Oh, R.; Gonsalves, T. Lightweight Denoising Diffusion Implicit Model for Medical Segmentation. Electronics 2025, 14, 676. https://doi.org/10.3390/electronics14040676

AMA Style

Oh R, Gonsalves T. Lightweight Denoising Diffusion Implicit Model for Medical Segmentation. Electronics. 2025; 14(4):676. https://doi.org/10.3390/electronics14040676

Chicago/Turabian Style

Oh, Rina, and Tad Gonsalves. 2025. "Lightweight Denoising Diffusion Implicit Model for Medical Segmentation" Electronics 14, no. 4: 676. https://doi.org/10.3390/electronics14040676

APA Style

Oh, R., & Gonsalves, T. (2025). Lightweight Denoising Diffusion Implicit Model for Medical Segmentation. Electronics, 14(4), 676. https://doi.org/10.3390/electronics14040676

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Denoising Diffusion Implicit Model for Medical Segmentation

Abstract

1. Introduction

2. Related Studies

2.1. Image Segmentation

2.2. Image Generation

3. Our Lightweight DDIM

3.1. Overview

3.2. Our Lightweight Residual Block and Self-Attention

4. Experimental Setup

4.1. Dataset

4.2. Setup in Training

5. Results

5.1. Visualizing Predicted Segmentation

5.2. Score Results: Recall and Dice Score

5.3. The Memory Size and Time to Output

6. Discussion

6.1. Highlighting Our Lightweight DDIM Through Experiments

6.2. Limitations of Our Lightweight DDIM

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI